Apache Spark is a popular data processing engine/framework that has been architected to use third-party schedulers. The schedulers that are available, however, involve a level of complexity that can be undesirable for many potential Spark users. To help fill this gap, we are pleased to announce that the HashiCorp Nomad ecosystem now includes a version of Apache Spark that natively integrates Nomad as a Spark cluster manager and scheduler.
Nomad's design (inspired by Google's Borg and Omega) has enabled a set of features that make it well-suited to run analytical applications. Particularly relevant is its native support for batch workloads and parallelized, high throughput scheduling (more on Nomad’s scheduler internals here). Nomad is also easy to set up and use, which has the potential to ease the learning curve and operational burden for Spark users. Key ease-of-use related features include:
Nomad also integrates seamlessly with HashiCorp Consul and HashiCorp Vault for service discovery, runtime configuration, and secrets management.
When running on Nomad, the Spark executors that run tasks for your application, and (optionally) the application driver itself, run as Nomad tasks in a Nomad job.
A user can submit a Spark application in the usual way. In this example, the spark-submit
command is used to run the SparkPi sample application against Nomad in cluster mode:
$ spark-submit --class org.apache.spark.examples.SparkPi \
--master nomad \
--deploy-mode cluster \
--conf spark.nomad.sparkDistribution=http://example.com/spark.tgz \
http://example.com/spark-examples.jar 100
A user can customize the Nomad job that Spark creates by explicitly setting configuration properties (see above) or by using a custom template as a starting point:
job "template" {
meta {
"foo" = "bar"
}
group "executor-group-name" {
task "executor-task-name" {
meta {
"spark.nomad.role" = "executor"
}
env {
"BAZ" = "something"
}
}
}
}
Job templates can be used to add metadata or constraints, set environment variables, add sidecar tasks and utilize the Consul and Vault integration.
The Nomad/Spark integration also supports fine-grained resource allocation, HDFS, and continuous monitoring of application output.
Our official Apache Spark Integration Guide is the best way to get started. You can also use Nomad's example Terraform configuration and embedded Spark quickstart to give the integration a test drive on AWS. Nomad-enabled builds are currently available for Spark 2.1.0 and 2.1.1.
Learn the installation and verification workflow for any Linux distribution that does not include HashiCorp software in its package repository.
Learn how JWT-based authentication works in HashiCorp Nomad using a custom GitHub Action as an example of machine-to-machine authentication.
Managing multiple clusters of HashiCorp tools can be complicated. Target CLI eases the burden by using context profiles to easily switch between different clusters and environments.