Backend batch processing at scale with Nomad: A GrayMeta case study
How did GrayMeta move an application from a traditional model of processing jobs out of a queue—on multiple VMs—to scheduling them as container jobs in HashiCorp Nomad?
Graymeta indexes large filesystems by extracting metadata and leveraging machine learning APIs to process files. Wanting to benefit from the security of running their indexing process in a container, reduce their code’s complexity that was scheduling jobs to run on specific nodes, consolidate infrastructure by treating all back-end batch processing machines as a pool of resources to scale up and down as a single unit, and several other reasons, they began looking for an easy to operate container orchestration layer. For them, it has to be dead simple to operate as they have to be able to run it in the cloud and behind corporate firewalls in on-premises deployments. They chose Nomad as it was the best fit for their requirements.
In this talk, Jason Hancock, Operations Engineer at Graymeta, will dive into the how of moving their application from the traditional processing jobs out of a queue on multiple VMs to processing the same jobs out of the same queue as before, but scheduling them as container jobs in Nomad. It will also touch on their local development environments that each developer runs on their laptop where they run Nomad inside a Docker container and use the raw_exec driver to mimic what we’re doing in production. As a final point, Hancock will discuss the increased portability and flexibility that moving to Nomad has afforded them, specifically in terms of being able to abstract the container scheduling and use a variety of different schedulers in different environments (for example, ECS in EC2).
Software and Systems Engineer, GrayMeta
Hello everyone. As mentioned, I'm Jason Hancock, I'm an Operations Engineer with GrayMeta. What we do, our application, we go out to large file stores, if something's in S3 or Azure, we'll go and we'll process all of the content in your bucket. We'll analyze the content. If it's video content we might break it down frame by frame, do facial recognition, speech to text, things like that. Extract the technical metadata from the documents and then we'll stick those into our data store and make it browsable and searchable.
Our workload that we throw at Nomad, it's a lot of small back-end batch jobs. Essentially you get one job per file. We do a little bit of optimizations when we hit a lot of small files. So we'll batch several of them up together into one job. But you can think of it, essentially, as one job per file.
Some jobs are small and fast, others can take multiple hours to complete. You can imagine processing a 20 kilobyte jpeg might take a few seconds versus processing a three hour long 4K resolution video of something like the presidential debates. That could take several hours to complete. I want to tell a little story, how we decided to pick Nomad.
In our original architecture, a user would come in either through our user interface or through the API and say hey I've got this storage sitting out here and S3 or Azure, and they would initiate what we call ... They would tell our app to start walking that file system. The walker would then drop files into a traditional message queue. We'd queue up all these jobs and then we'd have a back-end worker start popping jobs out of the queue and start processing them. It would then dump the results into a results queue and then we had a little persistence process that was popping the results out of the results queue and sticking them into our database through our API.
To scale this original architecture, essentially we spun up multiple back end workers across a fleet of servers. It scaled horizontally pretty well but there were some rough edges. Some customers have really really big files that we had to process and our scheduling algorithm was pretty dumb. At best it was dump the job into a queue and hope for the best. To counter our lack of intelligent binpacking we essentially over provisioned all of our nodes on disk. One terabyte SSDs are expensive. Because we couldn't really control which node would end up with the job to process a large file. And our CFO wasn't very pleased with that. Especially when they realized that we're provisioning one terabyte discs and we might be using 20 or 30 gigabytes of that at any one time, except for the case where we had to process one or two large files.
Other problems at the time we didn't really have a response for things like the image tragic bug. Our code we use a lot third party executables to mine data from these files. If someone were to harvest a file that had a malicious payload in it, we didn't really have a way to counter that. At the time, all of our processing nodes had a lot of sensitive information on it. They would connect directly to the client storage to process, to be able to download the files. So that was pretty bad. Some other problems we had were operability related.
We could easily add nodes to our cluster, we could just spin up more nodes and they would start reading from the queue and process files. But we didn't really have a way to tell a node to stop accepting new jobs while we let the existing jobs finish. And that would be pretty easy to build in except we had all these other problems as well. So we wanted a way that we didn't have to invent to be able to drain jobs from nodes. The final problem, our biggest problem, really, was multitenancy. So our application, we run it as a SaSS most of the time, but we also can run on-prem. But for our SaSS model, our infrastructure, we were essentially deploying an entire copy of the infrastructure per client. So this meant that each client had a back end processing cluster. Unless they were actively processing files, a lot of times they had servers sitting around idle. So this wasn't good. It also meant that I had to scale 30 separate clusters in response to load instead of scaling one up and down.
So we wanted a way to share these back-end processing nodes across customers. We started thinking about how we wanted to solve some of these warps, or some of these problems. We settled on trying to do the processing inside a container, because it would provide an extra layer of security. We also decided that the container was only going to get credentials to the absolute minimum amount of things that it needed. We were hoping that whatever solution we decided would try to resolve a lot of the pinpoints of our current architecture. At the same time, we wanted to avoid writing our own scheduling and binpacking algorithms. We didn't want to try and invent those here. And we were hoping that whatever solution we settled on was going to be simple.
So we started looking around at container orchestraters. For our solution he had to be really simple because we deploy ... We try to co-locate or compute wherever our customer's storage is. So if that's an Amazon, we'll deploy an Amazon, if it's in Azure, we'll deploy there. If it's on-prem on an NFS store, we'll deploy on-prem. A lot of times, at least in the on-prem scenario, I might not be the person operating it. It'll be a customer operating it on our behalf. It had to be simple enough for them to be able to operate. It has to be easy to operate. Has to have an easy to use API. It needs to be able to schedule based on multiple dimensions, CPU memory and for us, an important one was disc. We were hoping it would be easy to integrate with some of our existing infrastructure components like our centralized logging system and our metrics.
The only one that really fit the bill initially was Nomad. Nomad solved a few of the problems immediately out of the box. Being that Nomads are going to go and compiles down to a static binary, it's super easy to deploy. It's also very easy, we found, to operate. Out of the box you're able to scale notes up and when you need to remove nodes from the cluster it's really easy to flip on drain mode and start pulling them out of rotation. It solves the bin packing problem for us. Now instead of having to over deploy all of our nodes, we can only over deploy one node, or over provision one node. Nomad will figure out when that large file is starting to get processed, where to stick it in the cluster for us.
On the easy to use API front, the go client and also the Nomad CLI is really easy to use. So this is a dummy block of code that would set up a job to launch a container. And it's really easy to actually launch. This is assuming you have Nomad running on local host without TLS or anything special there. Another benefit that we found to Nomad, being that it's written in Go, and we happen to be a Go shop as well, is that if the docs weren't clear, we had a full client implementation to go look out. We could go reverse engineer the Nomad CLI and borrow from it. When we were considering integrating with other parts of our infrastructure, the one I mentioned earlier was our centralized logging system, we found it really easy. We didn't have to setup some complicated, docker logging solution. For us, it ended up being just adding a wild card path into our file configuration and it picks up the standard out and standard error streams from all of our containers and ships those automatically over to our elk stack.
We also found it really nice that Nomad has telemetry built into it and it was really easy to turn it on and point it at our stat site server. So we could get all of our metrics into graphite and graph and see what's going on under the hood in Nomad. We haven't really had any problems with Nomad that would warrant us analyzing all the metrics under the microscope. But it's nice to know that they're there if we needed to. So now that we had a Nomad cluster up and running and we can launch jobs on it, we had to re-architect our application to make use of Nomad. The first thing we decided early on was that the container themselves wouldn't directly access the storage. So we were going to use our API to proxy access to the files. The authentication that the containers were going to use to access the files through the API would be limited in scope.
You also see that instead of the backend worker reading directly from the files to process queue, that we have a scheduler process reading now, reading from it instead. Really what the scheduler does, all it does is it grabs a job out of the queue and it's responsible for scheduling a job in Nomad. It's really light weight. We didn't have to put a whole lot of logic into it. We just fire the job at Nomad and wait for Nomad to handle running the actual job. Once the job is running, the back end worker talks to the API, downloads the file, processes it, ships the results back. Instead of dropping the results back into a queue, this time we modified our API to accept the results directly from the back end worker and those end up in our database.
When the container is actually accessing our API, we're using a narrowly scoped olath token so that the only piece of information that the container can receive, or the only pieces of information that the container can actually have access to are those that are relevant to the file that it's harvesting. That means that it can only download the one specific file that we told it to and it can only update the results in the database for that particular result, for that particular file. The other big problem that we had that Nomad solved for us was the multitenency. We needed a way to be able to run multiple customers jobs on the same Nomad cluster. To do that we built a generic container image and then we inject all of our configuration via environment variables that Nomad injects into the container run time as it's running.
This includes URLs or APIN points. The off tokens, any specific configuration or keys that it would need to process the files. A lot of our problems were solved by moving to this new architecture. I wouldn't call security solved. It's just more isolation, more layers. It solved our maintenance issue so now we could scale up and down our clusters easier. I only had to manage a single cluster or a couple of clusters instead of 30 or 40 clusters. We can now deploy heterogenous nodes. Each node was more realistically provisioned to the resources that it would actually use. When the processors were running we received better utilization of those resources. Instead of a one terabyte disc and we're only using 20 or 30 gigs, now we're using 20 or 30 gigs of a 50 gig disc. After we launched this and started running it in production, some of our customers came to use and said, we don't necessarily want to be on the hook for managing a Nomad cluster. Is there something you can do since we're hosting an AWS. It would be nice if we could run it on an ECS cluster.
Since we're a Go shop and go has this concept of interfaces, we were basically able to abstract our scheduler back-end so that we had a Nomad interface and an ECS interface. That flexibility was really awesome for us. It means we can basically plug into any of the other container run times that we want or that our customers require. It makes it really easy to swap one for the other based on the customer's needs. For most cases we're still deploying Nomad.
One of the other requirements that we didn't realize at the time when we were sitting down to pick something was our development environments. We needed something simple and easy to operate and something that we wouldn't really have to do any mocking or faking of in our dev environment. We wanted to be able to run a full dev environment stack locally. Something that resembled production as closely as possible. For our dev environment we use a monolithic docker image that contains all of the back end components and pieces. Inside that container image we're actually running the same version of Nomad in Consul as we do in production. The only real difference there is that instead of using the docker driver we're using the raw exec driver.
The only time that it's not the case that we're running a different version then what's in production would be when we're actually testing new versions of Nomad and Consul. We actually test them first in our local dev environments before we start testing on our test infrastructure. The reason we use the raw exec driver in our dev environment is it avoids us having to do something creative where we're running something like Docker. Our dev team is really, they're dispersed around the globe. So a lot of times if somebody's having problems at their 8am, it's my midnight because there's folks all over the world. So it had to be simple enough that they could operate without my help or the help of my team because we're all based on the west coast.
Another reason to use raw exec for us was every time that we recompile our processing binary, we didn't want to have to rebundle that into a container image or do something creative where we mounted the binary into the container. Doing all of this allows our developers to iterate rapidly without faking anything. They consume the Nomad API the same way we do in production. The only difference is the docker driver or the raw exec driver. In summary, after re-architecting our platform to take advantage of Nomad and processing these files and containers we discovered additional flexibility. We had better security, we didn't have to implement a scheduler or a binpacker at all. We're able to better utilize compute resources, reduce overall cost, reduced operational overhead and it's a win win all around. We found it really easy to operate in production and I really have no complaints about Nomad. It's been a great experience using it.
That's our story of how we moved our back end batch processing into Nomad. I know I probably went through that a bit quickly. I'm a bit nervous, I don't do much speaking.
We've got good time for questions if anyone has any question for Jason. And I will go grab the mic.
For your workers, did you every consider just spinning up a new instance depending on the file size you were working with instead of having these instances constantly running and taking up resources?
We did consider provisioning things on the fly. Spinning up and down VMs or adding capacity dynamically, building some sort of autoscaler. The big problem there was that the response time. As soon as the messages in the queue our customers were demanding that they initially got some feedback that it had started processing. We could fake that while we spun up a node in the background. But really for our requirements, they wanted to have at least one node always up, always processing so that they could at least see some activity. They were fine with us walking a large file system and they understood that we might have to queue up a thousand files and run them serially through the system if they were only willing to provide one compute note or something like that. But they wanted to have one always up and always available.
Hi, I was just curious, how do you deal with one customer blowing out the rest of the nodes or just overwhelming it with a number of files or is there a way to segment down on customer ID and things?
There is. We actually throttle the number of jobs that any one customer can schedule on the cluster at a given time. We throttle it, we say that there can be up to a hundred or two hundred pending jobs. And we have monitoring setup so that once those conditions get hit we'll see it and we can react and scale the clusters up and down. Eventually we're going to replace all of that with automated processes as well.
Hey there, I'm just curious if you did anything to queue up training or if you just use the train function without anything special.
We're just using the train function without anything special. Most of the time, since we're not, our scaling isn't being done in an automated fashion right now, it's all being done manually, we just look at the cluster, pick the node that has the fewest number of jobs, put that in train mode and then terminate it once the job's finished running.
What advantages did giving your developers a monolithic container give you over doing them an image where you could more accurately represent what's happening in production?
Initially we started this by giving them, using vagrant. Unfortunately at the time it proved to be a little bit too unstable for the amount of stuff we were trying to run on a single instance. We're running things like elastic search and Redis and NSQD and Go and all these things. There was too much delta from developer machine to developer machine. It was becoming a maintenance nightmare. Also the time to provision was pretty long because they had to install all the packages from source. Not from source but install all the packages every time it came up. With docker we can have our CI server spit out an image and then they can all just pull down and consume that image.