White Paper

Monitoring the Cloud Operating Model with Datadog

Accelerating Multi-cloud Adoption through Observability and the HashiCorp Stack In this white paper, we look at how Datadog and HashiCorp tools work together to help enterprises align on a clear strategy for not only cloud adoption, implementation, and usage but also observability into their infrastructure, security, networking, and application deployments.

Executive Summary

To thrive in an era of multi-cloud architecture driven by digital transformation, enterprise IT must evolve from ITIL-based gatekeeping to enabling shared self-service processes for DevOps excellence.

For most enterprises, the goal of digital transformation is to deliver new business and customer value more quickly at a very large scale. The implication for enterprise IT, then, is a shift from cost optimization to speed optimization. The cloud is a core part of this shift as it offers the capacity to rapidly deploy on-demand services with limitless scale. 

Enterprises must consider how to industrialize the application delivery process across each layer of cloud infrastructure — provisioning, security, networking, and run time — to unlock the fastest path to value. This process includes embracing the cloud operating model and tuning people, processes, and tools to it. HashiCorp's suite of tools support this transition by helping enterprises provision, secure, connect, and run their applications and infrastructure in the cloud.

But as enterprises shift to the cloud operating model, they also must consider how to monitor dynamic cloud ecosystems. Datadog helps enterprises add observability to their environments by unifying telemetry data from across their technology stack. This helps IT teams monitor the health and performance of each layer of their infrastructure, get fast feedback on changes, and eliminate performance blind spots as they transition to the cloud.

In this white paper, we look at how Datadog and HashiCorp tools work together to help enterprises align on a clear strategy for not only cloud adoption, implementation, and usage but also observability into their infrastructure, security, networking, and application deployments.

Transitioning to a Multi-Cloud Datacenter

The move to cloud and multi-cloud environments is a generational transition for IT, as it requires shifting from largely dedicated servers in a private datacenter to a pool of compute capacity available on demand. While most enterprises begin with one cloud provider, there are good reasons to use services from multiple providers. Most Global 2000 organizations will inevitably use more than one, either by design or through mergers and acquisitions.

The cloud presents an opportunity to create better speed and scale optimization practices for new “systems of engagement,” which are the applications built to engage customers and users. These new apps are the primary interface for the customer to engage with a business, and are ideally suited for delivery via the cloud as they tend to:

  • Have dynamic usage characteristics and need to quickly scale loads up and down by orders of magnitude.

  • Be under pressure to quickly build and iterate. Many of these new systems may be ephemeral in nature, delivering a specific user experience around an event or campaign.

For most enterprises, though, these systems of engagement must connect to existing “systems of record,” the core business databases and internal applications that often continue to reside on infrastructure in existing datacenters. As a result, enterprises end up with a hybrid — a mix of multiple public and private cloud environments.

The challenge is how to deliver these applications to the cloud with consistency while also ensuring the least possible friction across the various development teams. To reduce friction, the transition to the cloud requires facilitating observability in three areas:

  1. Monitoring ephemeral environments at scale: Teams need to keep pace with the rate of change in a dynamic cloud environment. To accomplish this, they need observability tools that can auto-scale with their environments and provide real-time data for monitoring the performance of ephemeral cloud resources as soon as they spin up.

  2. Monitoring complex infrastructure: Cloud infrastructure can be complex, utilizing resources from various cloud providers, platforms, and technology stacks. Teams need to visualize the connections between all of these resources so they can efficiently diagnose performance problems.

  3. Monitoring for security and compliance: For most modern applications, teams create security and compliance policies to ensure that sensitive data is safe. Enforcing these policies requires knowing when systems become vulnerable, so teams need to be able to monitor all service activity to detect potential threats and be aware of any compliance issues with new or modified cloud resources.

The Shift from Static to Dynamic

The essential implication of the transition to the cloud is the shift from “static” infrastructure to “dynamic” infrastructure: from a focus on configuration and management of a static fleet of IT resources to provisioning, securing, connecting, and running dynamic resources on demand.

Let’s look at how this transition affects each layer of the stack:

  • Provision. The infrastructure layer transitions from running dedicated servers at limited scale to a dynamic environment where organizations can easily adjust to increased demand by spinning up thousands of servers and then scaling them down when not in use. As architectures and services become more distributed, the sheer volume of compute nodes increases significantly.

  • Secure. The security layer transitions from a fundamentally “high-trust” world enforced by a strong perimeter and firewall to a “low-trust” or “zero-trust” environment with no clear or static perimeter. As a result, the foundational assumption for security shifts from being IP-based to identity-based access to resources. This shift is highly disruptive to traditional security models.

  • Connect. The networking layer moves from being heavily dependent on the physical location and IP address of services and applications to using a dynamic registry of services for discovery, segmentation, and composition. Enterprise IT teams do not have the same control over the network, or the physical locations of compute resources, and must think about service-based connectivity.

  • Run. The runtime layer shifts from deploying artifacts on a static application server to deploying applications with a scheduler atop a pool of infrastructure provisioned on-demand. In addition, new applications become collections of services that are dynamically provisioned and packaged in multiple ways: from virtual machines to containers.

To ensure a smooth transition to the cloud and create an effective plan for monitoring cloud resources, teams should consider the following observability goals:

  • Collect the right data: Cloud resources generate a wealth of data for identifying and investigating problems. Knowing which kinds of data to collect, such as metrics and events, gives teams more complete visibility into their systems. This enables them to create meaningful alerts and quickly investigate performance issues in cloud environments.

  • Alert on what matters: Automated alerts draw attention to service degradations and disruption, enabling teams to quickly respond to an issue before it becomes more serious. However, not all alerts are useful or carry the same level of urgency. High-severity alerts can be used as direct pages while low-severity ones are better suited as records of activity. Because of this, teams need to look at what types of notifications are most important for their environments.

  • Investigate performance issues: Once an alert is triggered for an issue that requires attention, teams can use the monitoring data they have collected to swiftly diagnose the root cause. They can start their investigations by first looking at metrics and associated events from their highest-level systems then drilling down to other affected layers of their environments.

By considering these goals, teams can have visibility into the health and performance of their systems at any stage of the transition to the cloud operating model. Datadog brings together all of an environment's metrics, traces, logs, and other telemetry, giving teams a single source of truth for visualizing the connections between services, collaborating on real-time data, and investigating issues across infrastructure, security, networking, and applications. 

Monitoring the Cloud Operating Model with Datadog

The implications of the cloud operating model affect teams across operations, security, networking, and development. To successfully deliver the dynamic infrastructure necessary for each layer, enterprises need a system of shared services for their teams. This includes leveraging observability platforms like Datadog to consolidate the separate systems teams use to monitor their applications and underlying IT infrastructure.

Provision: HashiCorp Terraform and Datadog

Provisioning infrastructure is a core element of adopting the cloud as it offers teams a reliable mechanism for deploying and managing their resources. HashiCorp Terraform is one of the world’s most widely used cloud provisioning products and can be employed to deploy and manage infrastructure on any cloud platform while automatically enforcing compliance and governance policies via policy as code frameworks such as HashiCorp's Sentinel. 

IT teams can create Terraform templates to configure services used on one or more cloud platforms. Terraform integrates with all major configuration-management tools to allow fine-grained control over specifying an infrastructure's underlying resources. Finally, templates can be extended with services from many other independent software vendors to include monitoring agents, application performance monitoring (APM) systems, security tooling, DNS, content delivery networks, and more. Once defined, the templates can be used to establish a repeatable process for deploying cloud resources.

Reproducible Monitoring as Code

HashiCorp and Datadog have partnered to develop the HashiCorp Terraform Verified Provider for Datadog so teams can leverage Datadog's extensive API library with templates in order to add monitoring as code to their provisioning workflows. Teams can deploy any Datadog resource alongside new or existing infrastructure, which significantly reduces the gaps in visibility between services. With the Datadog provider, teams can use Terraform to:

  • Deploy monitors and dashboards for new and existing resources automatically

  • Set up integrations for cloud providers like Amazon Web Services, Microsoft Azure, and Google Cloud

  • Create new synthetic tests to verify application behavior in new environments

  • Create service-level objectives for newly deployed applications

Datadog can also visualize all resources managed by Terraform. The Terraform Cloud Business tier provides access to an Audit Trails API, which exposes a stream of audit events describing changes to the application entities (workspaces, runs, etc.) within a Terraform Cloud organization. By tying observability to infrastructure, Terraform and Datadog become the standard for provisioning application resources and monitoring their performance. Practitioners can import Terraform Cloud audit logs into Datadog and gain greater visibility into the details of their operation.

Secure: HashiCorp Vault and Datadog

The first step in cloud security is typically secrets management: the central storage, access control, and distribution of dynamic secrets. Instead of depending on static IP addresses, integrating with identity-based access systems such as AWS IAM and Azure AAD to authenticate and access services and resources is crucial.

HashiCorp Vault uses policies to codify how applications authenticate, which credentials they are authorized to use, and how auditing should be performed. It can integrate with an array of trusted identity providers such as cloud identity and access management (IAM) platforms, Kubernetes, Active Directory, and other SAML-based systems for authentication. Vault then centrally manages and enforces access to secrets and systems based on trusted sources of application and user identity. 

In addition, Vault helps protect data both at rest and in transit and exposes a high-level cryptography API for developers to secure sensitive data without exposing encryption keys. Vault also can act like a certificate authority and provide dynamic, short-lived certificates to secure communications with SSL/TLS. Lastly, Vault enables a brokering of identity between different platforms, such as AWS IAM, or Active Directory on premises, to allow applications to work across platform boundaries.

Ensure healthy Vault clusters

Using Vault as a basis for encryption-as-a-service and secrets management requires visibility into the state and performance of Vault clusters. Without this visibility, teams may overlook issues that affect the performance of their clusters and dependent services. For example, teams need to be aware of high leadership turnover in a Vault cluster before any services that leverage secrets to communicate with downstream clients become unstable.

Datadog provides full visibility into Vault cluster health and performance by collecting key metrics and logs from Vault servers. This enables teams to readily detect potential security issues, including high leadership turnover, and to track long-term cluster performance trends. Teams can also use this data to create a variety of automated alerts for cluster performance issues, including forecasting alerts to account for periodic fluctuations in cluster metrics.

Connect: HashiCorp Consul and Datadog

Networking in the cloud often presents the most difficult challenges for enterprises adopting the cloud operating model. The combination of dynamic IP addresses, a significant growth in inbound and outbound traffic, and the lack of a clear network perimeter present a formidable challenge. To solve this problem, HashiCorp Consul provides a multi-cloud service networking layer to discover and securely connect distributed services.

Consul creates a single source of truth for all registered services across multiple datacenters, clouds, and runtime platforms. It provides consistent service networking capabilities for both more traditional networking infrastructure, such as load balancers or firewalls, and distributed application platforms like Kubernetes and HashiCorp Nomad. Consul also provides an enterprise-ready service mesh that pushes routing, authorization, and other networking functionalities to the endpoints in the network, rather than imposing them through middleware. 

Maintain Stable Consul clusters 

Because Consul manages the network and configuration details that distributed services rely on for communication, monitoring the health of Consul clusters is key to ensuring those services continue performing as expected. Datadog's built-in Consul integration collects Consul-generated metrics and logs, giving teams greater visibility into cluster health so that they can prevent outages before they occur.

Datadog provides a real-time view of the state of a Consul cluster with host maps. Teams can group hosts by tags so they can determine if a performance issue is affecting individual nodes or an entire cluster. Teams can also monitor key cluster metrics and create alerts for critical issues that can affect Consul's overall stability, such as frequent leadership transitions.

Monitor Service Performance

Implementing a service discovery solution creates the foundation for improved analytics and performance monitoring. Beyond tracking the health of Consul clusters, Consul and Datadog can integrate to collect service level data such as error rates, request per second, total connections, and more. Using this data, teams can gain greater insights into how their services are running and identify specific areas of improvement. 

As organizations move towards a service mesh architecture, teams can implement distributed tracing within their applications to monitor the path of requests as they cross service and process boundaries. In distributed systems, individual requests may travel through multiple services (e.g., sidecar proxies, APIs, etc.) before resolving. Capturing performance data at each service endpoint enables organizations to identify bottlenecks so they can reroute traffic to healthy endpoints. Consul and Datadog make it possible to both capture and visualize this information, enabling greater observability into network performance.  

Run: HashiCorp Nomad and Datadog

Finally, at the application layer, modern apps are increasingly distributed while legacy apps also need to be managed more flexibly. HashiCorp Nomad provides a simple and flexible orchestrator to deploy, schedule, and manage legacy and modern applications for all types of workloads, including long-running services, short-lived batches, and system agents. 

Nomad is also multi-region and multi-cloud by design, with a consistent workflow for deploying any application. As teams roll out global applications in multiple datacenters or across cloud boundaries, Nomad provides orchestrating and scheduling for those applications, supported by the infrastructure, security, and networking resources and policies to ensure successful deployment.

Monitor Nomad cluster performance and availability 

Teams can use Datadog's Nomad integration to capture key metrics from Nomad clusters, so they can monitor performance indicators such as cluster capacity, job status, and memory pressure. Since Nomad clusters share resources to run a variety of workloads, monitoring capacity and other performance indicators helps ensure that clusters have enough resources to run all of them optimally. 

Teams can also leverage tags to track the performance of specific workloads, such as those that execute long-running processes. And with Datadog's alerting capabilities, teams can be automatically notified as soon as key cluster metrics, such as CPU utilization or memory usage, reach specified thresholds. This enables teams to address performance issues for critical workloads and maintain cluster stability.

Autoscale Nomad Workloads and Clusters to Meet Real-Time Demands 

Nomad provides an autoscaler for scaling workloads and clusters horizontally in order to meet real-time demand. Teams can leverage the autoscaler's Datadog APM plugin in their autoscaling policies to schedule when to modify resources based on specific metrics captured by Datadog, such as an underlying host's CPU and memory utilization. By using Datadog to help make scaling decisions, teams can ensure they always have sufficient resources to support their applications.

Conclusion

Adopting a common cloud operating model is critical for enterprises aiming to maximize their digital transformation efforts. The HashiCorp suite of tools provides solutions for each layer of the cloud to help enterprises make this shift to the cloud operating model.

Enterprise IT needs to evolve away from ITIL-based control points —focused on cost optimization — toward becoming self-service enablers focused on speed optimization. This means providing shared services across all four layers of the cloud infrastructure so teams can deliver new business and customer value at high speed.

Observability is an essential shared service across runtime, networking, security, and infrastructure. Datadog and HashiCorp are working together to promote a smooth transition to the cloud by equipping teams with the appropriate suite of tools for deploying to, securing, and monitoring in the cloud. 

Datadog supports this transition by establishing an enterprise-wide monitoring standard, delivering end-to-end visibility across each layer of cloud applications by unifying telemetry data from HashiCorp Terraform, Vault, Consul, and Nomad. This enables enterprises to collaborate around a single source of truth, using real-time data to visualize the connections between services and components in the cloud and identify the source of performance issues before they significantly affect users.

More resources like this one