Drift detection is essential to maintaining efficiency while minimizing risk and giving teams confidence in your infrastructure as code (IaC) provisioning workflow.
As HashiCorp’s CISO, I understand how infrastructure as code (IaC) practices enhance security and data governance compliance. But even with a standardized IaC workflow in place, including guardrails such as policy as code, IaC isn’t a panacea.
In the real world, infrastructure will continue to be changed and updated in response to an organization’s goals and unforeseen events. It’s hard to eliminate every instance of infrastructure state modification that isn’t tracked in your source of truth. Many organizations frequently end up with state files that don’t match the actual running infrastructure — a phenomenon known as “drift”. This can dramatically undercut the reliability and security benefits of IaC.
To maximize the value of infrastructure as code, it’s important to understand the causes of infrastructure drift, what the impact can be — especially on security — and the best ways to implement drift detection and remediation to help solve the problem.
Drift can occur for many reasons. First off, there may be cases where everyone in the organization is not using the established IaC workflows. That can create unrecorded differences between the infrastructure defined in code and the actual current state.
Emergencies are another common cause. In the midst of a “break-glass incident,” response management teams sometimes decide to bypass standard procedures for patching the infrastructure in order to fix the problem as quickly as possible. These kinds of shortcuts can cause changes to the resources that are tough to track and resolve in the code.
In addition, basic systems updates on cloud or service-provider systems can also accrue over time, resulting in significant drift as your infrastructure rules and provider systems gradually grow apart. For example, simple API changes (often for third-party services) might affect your infrastructure without being tracked in code.
Finally, cascading effects can make drift detection even more complex. When changing or creating new infrastructure resources, for example, there could be unexpected associated resources that aren’t codified. This creates a cascading effect of changing resource states affecting one another without anyone being aware of it.
As cloud adoption grows, organizational resources and processes become increasingly complex, which can create inconsistencies around the state of the infrastructure. Without standard procedures, notifications, or guidelines for adjustments, even temporary changes or the smallest tweaks to infrastructure can have significant impacts on the business including unplanned downtime, audit findings, security incidents, rework, and unused resources.
Most importantly, unrecognized infrastructure drift creates multiple security risks that need to be addressed before they become real problems. Drift can dramatically increase the probability of critical data exposures, perhaps due to mission-critical systems left open to public access by mistake or other unknown resources being left unsecured.
Additionally, development teams unaware of production environment changes not reflected in the IaC systems will almost certainly have to contend with applications “suddenly” crashing and deployment projects that unexpectedly fail.
So, how can organizations best handle drift detection, and what can they do to remediate the situation when drift is detected? Some companies opt to build in-house tooling that checks all states for drift at once and then sends reports via email to all users. But this makes it difficult to differentiate necessary changes from unneeded ones, since there’s no context behind the changes. Plus, it's up to you to make the manual changes to the resource or the recorded IaC state. This approach is too time-consuming to be scalable.
The underlying solution to these challenges comes down to answering two key questions:
Ultimately, teams concerned with drift should look for integrated drift-detection solutions. Ideally, this type of system would include all-in-one automated provisioning and central management so development teams can continuously monitor the infrastructure state to detect changes. Operating from a consolidated environment, the system should be able to send immediate notifications to the appropriate teams so they can take specific corrective actions any time a resource is altered.
For CISOs concerned with narrowing security gaps — both the kind they know about and the previously undetectable ones created by infrastructure drift — this type of solution can help strengthen the organization’s overall security posture without adding undue operational burdens.
Specifically, an integrated drift-detection approach can significantly reduce the potential for application downtime that could negatively impact user experience and, eventually, revenue. It can also empower teams to track and quickly address system changes, identify who made them and why, and record those changes for future reference or to adjust the standard workflow as needed.
Finally, a robust drift-detection system can boost operational agility by giving teams a consistent single source of truth from which they can collaborate. Working from the same information avoids the need to buy or develop custom tooling or deal with manual actions to refresh the state — all while granting superior visibility and accelerating time to resolution.
To recap, automated infrastructure provisioning offers significant productivity and security benefits. But what about when your infrastructure changes and the actual state isn’t reflected in the recorded IaC state? Drift is an unfortunate side effect of modern, dynamic infrastructure, where changes are made constantly.
To minimize the impact of infrastructure drift, you need a drift-detection system that gives your operations teams visibility and alerts the appropriate people to take action when needed. Working together systematically under a standardized process with centralized, automated tools promises to reduce risk, deliver greater system visibility and give teams the ability to resolve infrastructure issues more quickly.
HashiCorp Terraform provides built-in functionality for infrastructure automation with workflows to build, compose, collaborate, and reuse infrastructure as code and provides drift detection features. See Melar Chen’s blog post Drift Detection for Terraform Cloud is Now Generally Available for more information, or try Terraform Cloud for free to provision, change, and version infrastructure resources on any environment.
A version of this blog post was originally published on The New Stack.
Terraform Enterprise now supports more flexible deployment options for self-hosted environments, including cloud-managed Kubernetes services.
Assigning agents at the organization level provides a faster, more consistent, and scalable approach to agent pool configuration.
Learn how creating a golden image pipeline can help unify and streamline your imaging and provisioning workflows throughout your infrastructure estate.