Running Apache Spark on HashiCorp Nomad

Running Apache Spark on HashiCorp Nomad

Jul 25 2017 Rob Genova

Apache Spark is a popular data processing engine/framework that has been architected to use third-party schedulers. The schedulers that are available, however, involve a level of complexity that can be undesirable for many potential Spark users. To help fill this gap, we are pleased to announce that the HashiCorp Nomad ecosystem now includes a version of Apache Spark that natively integrates Nomad as a Spark cluster manager and scheduler.

Why Spark on Nomad?

Nomad's design (inspired by Google's Borg and Omega) has enabled a set of features that make it well-suited to run analytical applications. Particularly relevant is its native support for batch workloads and parallelized, high throughput scheduling (more on Nomad’s scheduler internals here). Nomad is also easy to set up and use, which has the potential to ease the learning curve and operational burden for Spark users. Key ease-of-use related features include:

Nomad also integrates seamlessly with HashiCorp Consul and HashiCorp Vault for service discovery, runtime configuration, and secrets management

group "executor-group-name" { task "executor-task-name" { meta { "spark.nomad.role" = "executor" }

  env {
    "BAZ" = "something"

} }

Job templates can be used to add metadata or constraints, set environment variables, add sidecar tasks and utilize the Consul and Vault integration.

The Nomad/Spark integration also supports fine-grained resource allocation, HDFS, and continuous monitoring of application output.

Getting Started

Our official Apache Spark Integration Guide is the best way to get started. You can also use Nomad's example Terraform configuration and embedded Spark quickstart

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now