In the beginning
Pioneering a revolution does not happen by doing things the way they have always been done, but instead by thinking outside the box and discovering new approaches. From designing cutting-edge custom hardware platforms to writing highly optimized software code, Cruise is finding the newest and best tools to do the job. The Cruise approach of a fully integrated hardware and software stack has made them an industry leader and first to pilot a fleet of driverless cars in one of the world’s most demanding urban environments, San Francisco, California. Teams within Cruise were looking for a better way to work together: the Application Development Teams were responsible for building the applications that ultimately make up the autonomous vehicle technology, while the Infrastructure Team, consisting of fewer Site Reliability Engineers (SREs), was responsible for providing the infrastructure and resources for those applications.
As they began to institute core workflows, they needed a way to easily onboard more operations, SREs, and developers. Their goal was to establish a collaborative approach to infrastructure provisioning that allows both teams to innovate, validate, iterate, and repeat as safely and efficiently as possible.
Since the very early days the Cruise infrastructure team, initially just one person, adopted HashiCorp Terraform open source (OSS) to provision cloud resources using the HashiCorp Configuration Language (HCL). In order to facilitate code sharing, all configuration code was kept in a single repository and a small number of engineers had access to deploy configuration changes.
With success comes growth — after adopting Terraform OSS, Cruise’s infrastructure team expanded to more than 40 SREs, while the development organization grew to more than 800 software engineers.
At this point, the infrastructure team realized they had outgrown the workflow they had put in place with Terraform OSS.
Management of large mono-repositories became a challenge and impacted productivity
Manual coordination of changes became complex and slowed down provisioning
Elevated risk due to errors introduced by a single developer that would impact many others
Difficult to ensure guardrails were in place each time a developer provisioned infrastructure
An approach that scales with the organization
To maintain the increase in teams, applications and software throughout Cruise, engineers needed a way to quickly scale up infrastructure in a sustainable manner. Terraform has provided a pivotal technology to help scale Cruise to manage these needs. With the help of Terraform Enterprise, SREs are able to put Terraform in the hands of developers to help increase efficiency.
Increase Productivity for the Entire Organization
Using workspaces, the team was able to decompose a mono-repository into separate, well-defined micro-repositories and segment the usage and execution of the code within the workspace. Each user has access to their own workspace where they can write infrastructure as code and execute runs to create infrastructure. The use of workspaces also eliminates the impact of errors on infrastructure created and managed outside that workspace. The organization also uses modules which are codified and reusable components of infrastructure. The modules are approved by the SRE team and made available for application developers to use within their individual workspaces.
Reduce Risk with Sentinel Policy and Governance
With hundreds of individuals provisioning infrastructure, it was necessary to have rules established to prevent inadvertent mistakes that impact security and operational best practices. To do this, the core infrastructure team leverages Sentinel, HashiCorp’s policy as code approach to codify policies in version control, and then have them enforced before infrastructure is provisioned. This approach helps them automate policy checks such as:
- Are the resources being provisioned approved?
- Do AWS S3 buckets have logging enabled for audit?
- Does this role have permission to delete resources?
- Is the resource being created in the same region as the VPC subnet?
Sentinel, along with modules, allows the SRE team to ensure that any infrastructure being provisioned is done in a way that is secure and follows operational best practices.
Increase Developer Agility
In the Cruise organization, the SREs are the infrastructure experts and create a baseline standard for how infrastructure can be provisioned and consumed. The team uses modules to provide a library of 150 versioned, validated and approved resources that can be consumed by developers to provision in Terraform Enterprise. If a module does not exist, developers can create one and submit it for validation and approval by the SRE team. The repository of modules was initially a small set of approved resource modules and over time has grown to include common resources for the majority of the development organization. This process was then built into an automated pipeline using a webhook to integrate with GitHub.
Increased efficiency and productivity throughout the organization
Mitigated risk of inadvertent errors and improved developer productivity
Automated infrastructure provisioning for 100s of developers
As the teams at Cruise were rapidly building autonomous vehicle software to run on self-driving cars, they needed to establish an infrastructure automation pipeline in order to efficiently scale and support the growth they were experiencing. The Terraform Enterprise workflow has enabled the Cruise team to grow while continuing to maximize operational efficiency and ensure proper guardrails are in-place to minimize risk.
Mark Sparhawk Site Reliability Engineering Manager Cruise
Mark Sparhawk is a site reliability engineering manager at Cruise and governs all aspects of the Autonomous Vehicle program. Mark is passionate about building high performing teams with an emphasis on diversity, inclusion, and belonging. He is responsible for multiple embedded SRE teams, centralized tooling, library and services team as well as capacity and efficiency teams.