Hear about Eventbrite's system for streamlining their AWS multi-account strategy using a vending machine-like setup with TLZ and Terraform Enterprise.
Maddie Cousens: Hello everyone. And thank you for welcoming us into your virtual stream of HashiConf Digital 2020. My name is Maddie, and I'm joined by my colleague, Luca. We are on the Site Reliability Engineering team at Eventbrite, the largest event technology platform.
Today, our goal is to give you an authentic picture of how we use the AWS Terraform Landing Zone at Eventbrite to streamline our multi-AWS account strategy — and the work we did to take it from source code to production. We hope you leave this talk with a better understanding of what this accelerator is, whether it could be a solution for your team, and an idea of the work involved to get it up and running.
We want to start this talk by looking at the problem space. Why is Eventbrite moving to a multi-AWS account infrastructure? What were our requirements of this solution and how do we arrive at using the AWS Terraform Landing Zone? We think it would then be helpful to give you a zoomed-out picture of the account vending machine in action, through the eyes of a requesting client — before diving into the work we did to adapt the Terraform Landing Zone to fit our needs.
Let's start with a little background on Eventbrite's journey. Eventbrite's story isn't different from many companies who have grown from a small startup with 30 co-located engineers to 300 plus engineers across multiple time zones.
As many of you know, with this growth comes a bit of a change in engineering culture and architecture. We started with a monolith code base and a small Ops team keeping the lights on. In the past couple of years, we have focused on the very necessary transition to where we are today — highlighted in red.
Like many companies, we knew a monolithic architecture was inhibiting our growth as an engineering team. And we started breaking up our monolith into microservice-oriented architecture to enjoy the benefits such as independent deployments, clear ownership, improvements, system reliability, and a better separation of concerns.
We also sped up development by transitioning from an Ops team to an SRE organization and offloaded some infrastructure ownership to developers. However, as we look for the future, we realized we had still not achieved the developer efficiency we desired nor resolved common reliability issues. Despite breaking the monolith into microservices, the underlying infrastructure was built in the same AWS account and too tightly coupled to have clear ownership, efficient development, and reliable systems.
We decided to move to a multi-AWS account strategy and decouple our systems into isolated domains. In our definition, domains are fully isolated segments of our business within an event-driven architecture that has both compute and data layer independence.
By segmenting into domains, we address two of our goals; having global engineering teams running production systems without the bottleneck of SRE and an increase in system reliability. In a domains world, developers can operate production systems. They no longer need to understand how the entire platform works and can move quicker with fewer dependencies. Domains are also not susceptible to instability encountered in other domains.
We have this shiny bright future of domains, but how do we get there? Before we can even start thinking of development, we must first build the infrastructure for managing, maintaining, and evolving a multi-AWS account solution. What are our requirements of this multi-AWS account infrastructure?
Our first category of requirements is governance. As mentioned, we want to use AWS accounts to establish the high walls of isolation between domains. Secondly, it's imperative to us that SRE owns the networking components between domains and the shared infrastructure. This requirement is also about delivering an easy-to-use solution for our engineers, where they can immediately start developing — and some of the harder aspects of connecting to a shared infrastructure are abstracted away.
Regarding security control and compliance, we want to be able to enforce a set of security policies across all accounts and control the services used in actions taken in via policies.
Finally, we get to automation. We don't have the bandwidth to set up these domains by hand, and manual configuration also leads to a lack of uniformity and compliance. Thus, it is imperative that the creation of domains is fully-automated with minimum human toil.
It was essential to our team that everything was done as infrastructure as code. This is not just referencing our infrastructure within the AWS platform, but also integrations with third-party providers.
As we started our search for an existing multi-AWS account solution, we first stumbled across the AWS Control Tower. The Control Tower is a pure AWS service meant to ease governance and security for multi-account organizations.
However, as we started a proof of concept, we quickly realized the amount of manual configuration and — at the time — lack of integration with our SAML provider, Okta, was a blocker. But as often as the case, AWS has more than one solution up their sleeves. We then looked at the automated solution — the AWS Landing Zone.
AWS Landing Zone seems to be a more mature solution than the Control Tower. However, it still had similar limitations. It is close to the AWS ecosystem, has no out-of-the-box integration with Okta, and CloudFormation is the only infra as code option.
At this point, we were a bit stuck as to where to turn next. Luckily, our friends at HashiCorp knew of a new project that the AWS Professional Services org was working on called the AWS Terraform Landing Zone or TLZ.
TLZ emerged from various requests within the industry to have a Terraform-based AWS Landing Zone. We were fortunate to be granted early contributor access to the TLZ codebase.
Let's recap what the AWS Terraform Landing Zone Accelerator is, for those who did not get the chance to see Brad present last year. The word accelerator is important here as it's not a product, but more of an automation pattern developed by the AWS Professional Services org team. The most simple way we can explain it is that you can think of TLZ as three parts.
This accelerator aims to codify security and compliance best practices by orchestrating the provisioning — or what TLZ calls baselining — of application AWS accounts and core accounts used for things like logging, security, [shared services]https://www.hashicorp.com/resources/cloud-operating-model-devops-security-networking-challenges), and networking.
The AVM puts all the automation pieces together into the concept of vending an account. The AVM consists of a DynamoDB table that stores account request information and triggers a series of about 10 Lambda functions. These Lambda functions automate everything from account creation and baselining to setting up what we call the Vended Account Ecosystem, with integrations like Terraform Enterprise, VCS, and SAML providers.
To make the AVM magic happen, TLZ leverages essential components like Terraform Enterprise or Terraform Cloud, like enterprise workspaces, the API, and the private module registry. Without these, it would be very hard to achieve seamless automation or for us to offload infrastructure management to developers in a safe way.
We've given you some background on why Eventbrite is moving to a multi-AWS account strategy and how we arrived at the choice to use the AWS TLZ accelerator. We now want to focus on the journey of what taking the TLZ codebase was like, and adapting it to our company needs.
However, before taking a look behind the scenes at these adaptations, we think it would first help to give a vision of the account vending machine in action through a requesting client's eyes and what the resulted vended account ecosystem looks like.
The first step of the process is the account request. One of the first things we added to TLZ was a Terraform-based account request procedure via pull requests. This change allowed for input field validation and for all requests history to be checked into version control.
We like to think that PRs are the new ticket. I am now a developer, and I want a new account for my domain that handles event listing pages. I create this pull request that you can see here. Not surprisingly, it's an easy-to-use Terraform module because I love my SRE team, and they build amazing, easy-to-use interfaces. These five fields are all I need to provide. My PR is approved, I merge it, and a terraform plan is kicked off in Terraform Enterprise. At this point, I go and make a coffee.
In the account vending process, we have one point of human intervention before all the automation begins. Members of the SRE team are the only ones with the privilege to apply changes within the account requests enterprise workspace. An SRE clicks the button — and Voilà — applied successfully. At this point, the account vending machine is kicked off. After 5-10 minutes of making coffee, I come back to my desk, and what do I now have?
In my SAML provider dashboard — which is Okta — I now see two applications. One for accessing our Terraform Enterprise instance and another for accessing the console of the newly vended AWS account.
I also see a new GitHub repository is created and assigned to my team to hold infrastructure as code. If I've already vended an application account for this domain — say another dev account or pre-prod account — this repository will have already existed because it holds the code for infrastructure in all environments.
This is an example of a change that we introduced. Originally, each environment for a domain had a separate repository. However, this didn't support our vision for future deployment pipelines within Terraform Enterprise. Thus, each team is vended one repo for their infrastructure and can decide what strategy they'll use to maintain their code across multiple environments — whose states are managed by separate enterprise workspaces.
Last but not least, as a developer on the event listing domain team, upon accessing my TFE instance, I now have access to two workspaces. I have read access to the baseline enterprise workspace and write access to the infra enterprise workspace. My infra workspace is linked to the repository I was just vended. Of note, this repository already has the Terraform code to import the remote state of the baseline workspace.
The infra workspace is also provisioned during vending time with all the necessary variables to create resources — whether this is infrastructure in AWS or third-party integrations for things like monitoring and error tracking. The core concept of baselining and a baseline workspace is something we will go into further detail later on. I'll now pass it over to my colleague, Luca. Thank you all for joining HashiConf Digital 2020, and enjoy the rest of your day.
Luca Valtulina: Hi, everyone. Welcome to HashiConf Digital 2020. In the second part of this talk, we want to take a closer look to the work we've put into adapting TLZ to our needs, focusing on some specific areas that we feel will be familiar to anyone approaching a similar solution.
First, we want to make clear that TLZ is not meant to be a plug-and-play product, but rather an accelerator. This means adopting this solution will require some effort. Which in our case meant having two engineers — yours truly here today — working almost full-time on this project for three months.
It's worth mentioning, though, in our case, we entered this project as early contributors. We were able to fork the first version of the TLZ source code that was still far away from the GA status that will be hopefully be released soon. Still, we would like to show some of the work that we think is necessary to successfully bootstrap TLZ and adapt functionalities to your company needs.
A lot of work goes into adapting the AVM to meet your own needs. As shown already, we added the account requesting procedure — the pull request. We also had to build some automation steps that were required but were missing from the initial codebase. For instance, to interact with third-party providers. This was done by modifying existing Lambdas and by adding some new ones.
TLZ intentionally comes without a core network to give the possibility to either deploy on your network or attached to an existing one. In our case, as we were moving to an event-driven architecture, we dedicated quite some time in deploying our core transit network.
Services that are shared across vended accounts are another core element of TLZ. We've named a few services already that are essential to bootstrap TLZ, like Terraform Enterprise and the VCS provider. But they are not the only ones, of course — and given that each company has its own stack, the setup of this service is part of the duties of the team bootstrapping and managing TLZ.
Once automation, shared service, and shared infrastructure are in place, a set of services and policies that baseline the vended account must be defined. This set includes integration with a shared infra and security guardrails — both from AWS code guidelines and custom to your needs. And though the set is applied to vending time, it can be dynamically changed at a later stage.
We understand that the concept of baselining can be confusing. Therefore, we will dedicate some time to show you this process in more detail. At Eventbrite, we decided to modify the AVM in order to use a unified baseline for all vended accounts. Though this set must be pre-defined, we can expect it to be subject to changes. We're going to see in a second how TLZ allows us to apply these changes in a seamless way to existing accounts as well.
But let's first start with a brand-new account. Here we can see one that was created by the AVM. First of all, the TLZ codebase comes with a default baseline that covers some of the hardest security challenges within a multi-AWS account infrastructure. Among these are security guardrails and a backend of AWS services for shipping audit logs to dedicate the core account owned by our security team. These resources are provided to the account via service control policies at an organizational level and as resources in the baseline workspace.
As mentioned already, networking is an area where work needs to be done. We specifically have taken the chance to build an event-driven architecture and attach it to the TLZ core network. These include adding some network resources to the baseline to feed our internal IP address spacing model and to attach the vended account to the underlying shared infrastructure.
We've also polished and adapted access management for the vended accounts by better scoping dedicated roles to our internal structure and defining some policies that are enforced across the organization. This is one of the areas where Terraform Enterprise shines. Access keys can be stored sensitively in the enterprise workspace, removing the need to create and maintain scope access tokens throughout the organization. In our case, accounts also required a dedicated internal domain. Furthermore, DNS forwarders towards our legacy infra and the event message bus are also shared across the whole organization and added in the baseline process.
Here we have our baseline vended account — well somebody's version of it at least — but we don't want to bore you with this any further. In our baseline, we have services required to meet our security compliance standard and to abstract from our developers, the harder part of a multi-AWS account infrastructure like connectivity to the core network and DNS.
As mentioned, the same baseline process is used for all accounts. This is done primarily so account baselines don’t drift too far apart from each other — giving us more control when changes need to be applied to either a shared service or the core component. TLZ is once again the tool making this really easy. We can point the baseline repository to all of our accounts’ baseline enterprise workspaces; this is done at vending time.
Here we can see to the two vended accounts with the same baseline but different workloads. We can see account number one — a bit old school there — the product application load balancer, and some EC2 instances. While account number two went for a serverless base infra.
Going back to the baseline process later down the road, we decided to give accounts private certificate management capabilities — and specifically to add this as part of our baseline. By simply applying this change to our unified baseline, the change was made available to all application accounts without impacting the deployed workload.
Needless to say, the application accounts vended from this point onward will use the latest available baseline, which includes private certs management. As an extra in our codebase, we have also built the possibility to have a dedicated baseline for a single account. This is done by simply pointing an account baseline enterprise workspace to a branch within the baseline repository — just a bit of Terraform Enterprise magic.
Given what we've shown so far, we hope you saw that we are seeking autonomy with constraints, which is a way to empower teams to the golden path approach. At Eventbrite, we define the golden path as the way we handle technology and architectural guidance. The overall goal is to provide teams with agency while also ensuring we have a performant and relatively consistent architecture.
The idea of the golden path is not to stifle innovation. Technologies that are on the path have been vetted by us to ensure that they will run well. We understand how and when they fail, and we have the expertise needed to use them — rather than just being something that we launch and then we're stuck with. If we truly have a gap in our technology stack that needs filling, adding something to the golden path will be a relatively quick process.
In the context of TLZ, some of the aspects of the golden path are managed via a set of guidance policies, defined as code and enforced at three different levels. AWS services and actions can be enabled or disabled via service control policies inherited by all created accounts from the AWS organization route — the so-called master player — and the account’s organizational unit.
Access to services is managed via custom IAM policies, attached to IAM roles within an account during the baseline process. And at last, Sentinel policies are added to both baseline and infra enterprise workspaces and used to give a soft warning to those teams diverting from the golden path.
Now let's take a moment and go back to see how we've done with the requirements we defined initially.
For governance, if you remember, we wanted to use AWS accounts to establish the high walls of isolation between domains and our SRE team owning network components and shared infrastructure. We have shown today how we have achieved account governance, thanks to the Account Vending Machine.
While infrastructure governance was not a big part of this presentation, I've mentioned that we have spent quite some time creating a dedicated network as part of our new event-driven infrastructure.
For security, we're required to be able to enforce policies across all accounts to control services used and action taken. We've shown how security guiderails are defined in the account baseline — while technology and architectural guidance are defined in the golden path.
And at last, we wanted this process to be automated with minimal human toil — with infrastructure as code used everywhere. We've shown how this was the case from the PR account request all the way to the integration between Terraform Enterprise and our VCS.
So having checked our requirements, it's time now to draw some conclusions on what we've seen. First, we can say that — once released — the AWS Terraform Landing Zone will be a valid solution out there to set up a multi-AWS account infrastructure. But it is not — and as far as we know, it will never be — a plug-and-play product. But rather an accelerator which requires lifelong development and adaptation, as we've seen partially today.
We cannot underline enough how important it is to define a golden path for your engineering teams to fully leverage the flexibility and tremendous growing speed a solution like this one we show you today has — without letting them encounter common pitfalls.
I guess that a question that everybody wants to ask us now is, was it worth it? At Eventbrite, we use the AWS Terraform Landing Zone to power our new domain infrastructure via a self-service account vending machine. We can say now that it was totally worth it to adopt this new solution. Without it, we simply could not have achieved our multi-AWS account strategy while meeting all of our requirements.
Thank you so much for joining our talk. We sincerely hope you've enjoyed it and have learned about a new solution that has changed the way we work at Eventbrite. Ciao. Ciao. And have a good HashiConf.