Apache Spark is a popular data processing engine/framework that has been architected to use third-party schedulers. The schedulers that are available, however, involve a level of complexity that can be undesirable for many potential Spark users. To help fill this gap, we are pleased to announce that the HashiCorp Nomad ecosystem now includes a version of Apache Spark that natively integrates Nomad as a Spark cluster manager and scheduler.
Nomad's design (inspired by Google's Borg and Omega) has enabled a set of features that make it well-suited to run analytical applications. Particularly relevant is its native support for batch workloads and parallelized, high throughput scheduling (more on Nomad’s scheduler internals here). Nomad is also easy to set up and use, which has the potential to ease the learning curve and operational burden for Spark users. Key ease-of-use related features include:
Nomad also integrates seamlessly with HashiCorp Consul and HashiCorp Vault for service discovery, runtime configuration, and secrets management.
When running on Nomad, the Spark executors that run tasks for your application, and (optionally) the application driver itself, run as Nomad tasks in a Nomad job.
A user can submit a Spark application in the usual way. In this example, the spark-submit
command is used to run the SparkPi sample application against Nomad in cluster mode:
$ spark-submit --class org.apache.spark.examples.SparkPi \
--master nomad \
--deploy-mode cluster \
--conf spark.nomad.sparkDistribution=http://example.com/spark.tgz \
http://example.com/spark-examples.jar 100
A user can customize the Nomad job that Spark creates by explicitly setting configuration properties (see above) or by using a custom template as a starting point:
job "template" {
meta {
"foo" = "bar"
}
group "executor-group-name" {
task "executor-task-name" {
meta {
"spark.nomad.role" = "executor"
}
env {
"BAZ" = "something"
}
}
}
}
Job templates can be used to add metadata or constraints, set environment variables, add sidecar tasks and utilize the Consul and Vault integration.
The Nomad/Spark integration also supports fine-grained resource allocation, HDFS, and continuous monitoring of application output.
Our official Apache Spark Integration Guide is the best way to get started. You can also use Nomad's example Terraform configuration and embedded Spark quickstart to give the integration a test drive on AWS. Nomad-enabled builds are currently available for Spark 2.1.0 and 2.1.1.
Running Boundary workers as dynamic workloads can be challenging. Using the Nomad and Vault integration along with a custom Vault plugin, this process can be seamlessly automated.
HashiCorp Nomad 1.6 introduces node pools to help manage multi-tenant Nomad clusters, along with a redesigned job UI, Nomad Pack improvements, and more.
Learn about the internals of Nomad's evaluation broker and how we recently reduced scheduler loads by 90% during rapid cluster changes.