Running Apache Spark on HashiCorp Nomad

Running Apache Spark on HashiCorp Nomad

Jul 25 2017    Rob Genova

Apache Spark is a popular data processing engine/framework that has been architected to use third-party schedulers. The schedulers that are available, however, involve a level of complexity that can be undesirable for many potential Spark users. To help fill this gap, we are pleased to announce that the HashiCorp Nomad ecosystem now includes a version of Apache Spark that natively integrates Nomad as a Spark cluster manager and scheduler.

Why Spark on Nomad?

Nomad's design (inspired by Google's Borg and Omega) has enabled a set of features that make it well-suited to run analytical applications. Particularly relevant is its native support for batch workloads and parallelized, high throughput scheduling (more on Nomad’s scheduler internals here). Nomad is also easy to set up and use, which has the potential to ease the learning curve and operational burden for Spark users. Key ease-of-use related features include:

Nomad also integrates seamlessly with HashiCorp Consul and HashiCorp Vault for service discovery, runtime configuration, and secrets management.

How it Works

When running on Nomad, the Spark executors that run tasks for your application, and (optionally) the application driver itself, run as Nomad tasks in a Nomad job.

Spark Nomad Diagram

A user can submit a Spark application in the usual way. In this example, the spark-submit command is used to run the SparkPi sample application against Nomad in cluster mode:

$ spark-submit --class org.apache.spark.examples.SparkPi \ --master nomad \ --deploy-mode cluster \ --conf spark.nomad.sparkDistribution= \ 100

A user can customize the Nomad job that Spark creates by explicitly setting configuration properties (see above) or by using a custom template as a starting point:

job "template" { meta { "foo" = "bar" }

group "executor-group-name" { task "executor-task-name" { meta { "spark.nomad.role" = "executor" }

env {
  "BAZ" = "something"

} } }

Job templates can be used to add metadata or constraints, set environment variables, add sidecar tasks and utilize the Consul and Vault integration.

The Nomad/Spark integration also supports fine-grained resource allocation, HDFS, and continuous monitoring of application output.

Getting Started

Our official Apache Spark Integration Guide is the best way to get started. You can also use Nomad's example Terraform configuration and embedded Spark quickstart to give the integration a test drive on AWS. Nomad-enabled builds are currently available for Spark 2.1.0 and 2.1.1.

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now