Case Study

How Jet.com uses HashiCorp Nomad on Azure to run its applications

Mohit Arora, senior director of engineering at Jet.com, shares how the company schedules workloads with HashiCorp Nomad.

Jet.com is changing the way the world shops with innovative technology and a people-centric approach. We move at the speed of a startup with access to unrivaled resources.

In 2016, we looked at schedulers, to increase utilization and reduce costs. After evaluating several, we chose HashiCorp Nomad for four reasons:

  • It's cross-platform, supporting Windows and Linux
  • It's flexible enough to work with some workloads in containers and some not.
  • Easy to use, because we already had experience with using Consul.
  • Integrates with Consul and Vault.

In this talk, Mohit Arora, senior director of engineering at Jet.com, discusses how the company, now a Walmart subsidiary, uses Nomad–including how his team made the implementation a success.

Speaker

Transcript

Good morning everyone. My name is Mohit, and I'm part of Cloud Platform Team at Jet. For those of you who are not familiar with Jet, Jet is an e-commerce company based out of New Jersey. We are part of Walmart family now. Jet was acquired by Walmart late last year. Our value proposition for our customers at Jet is that prices drop as you shop. As you build bigger baskets, as you add more items to your shopping cart, our system, in real time, tries to compute the savings based on the supply chain efficiencies that can be brought in to fulfill your order. We pass on that saving back to the customer. A quick example, if you are buying a shirt at Jet, $25. If you are buying a second shirt at Jet, normally on all other e-commerce platforms, that would be $25. But here, if it is coming from the same merchant, can be shipped in the same box, these are the supply chain efficiencies we are talking about. The second shirt that you buy at Jet might be $20 because of the supply chain efficiencies that can be computed in real time.

Jet was born in cloud in 2014. All the systems that power jet.com run in public cloud, which is Microsoft Azure in our case. Our journey with Cluster Scheduler started late last year. When we were trying to bring the utilization of another platform up, we are trying to drive the cost down, so we were trying to compare all the schedulers, which are out there in the market. We ended up comparing Nomad and some other schedulers, which are available. All these schedulers promise you similar things, but Nomad won for us primarily because of these four things.

First is cross-platform. Our applications are built in F# and our cloud platform is written in Golang, so our F# services run in Windows, in production. We wanted to have a scheduler, which can support both Windows and Linux. Nomad was the clear winner for us.

The second is flexibility. So even though we were big on containers we had legitimate use cases where we wanted to run some workload out as a general purpose workload and in some instances not packaged as a container. Nomad also gives us that flexibility to either run a general purpose workload or we can run a container.

Ease of use. So we were heavily invested before Nomad in Consul and Nomad and Consul have same semantics so we already knew how to run Consul in production when we tried Nomad out, it was pretty easy to get started with so that also played a factor in our decision.

The last is the fantastic Consul and Vault integration. As I said, we had a long list in used cases that we drive out of Consul and Nomad has out of the box integration with both Consul and Vault and that was also a factor when we were making a decision.

After we picked Nomad we built an ecosystem around it. We never exposed Nomad directly to developers. What we essentially ended up doing is we built a tool in front of Nomad and all we asked developers to do is in their version control system, they check in a couple of manifest files. One of the files is a service manifest file. A service manifest is what your service is, what are health check of service, of that service. And then the deployment manifest file. Deployment manifest talks about how are you going to deploy that service, if it needs to be co-located with other services and all that. And the rest of the magic is done by this tool that we built, which we call Gizmo in front of Nomad. Gizmo does all the stuff and then ultimately it creates a Nomad file that it deploys on the Nomad cluster.

When we took this Nomad as a platform and on top of that this tooling we exposed it to developers, there were immediate benefits that we saw.

The first is the continuous delivery story. Developers love the fact that they can define how their services needs to be deployed in version control system and then rest of the stuff is taken care of for them by Gizmo but they can change, go back to the version control system and run the next run of the build and it will update the confirmation for them.

The second important win for the app dev team was geo redundancy story. In Jet we run services in multiple modes. Some of them we call hot-hot, like depending on what it is doing or hot-warm, hot-cold so if you are writing a service it has to opt for one of these models, which for that defines how it will be deployed in multiple regions that we run out of. With this journey in the manifest file developers can pretty much define, which mode they want to run their service in and everything else is taken care for them and they love this fact.

Scaling. Before Nomad, if the app dev teams are worrying about scaling they are worrying about the computing structure as well. In this offering what we have for them is that we have a separate process, which keeps monitoring the Nomad infrastructure and if it needs to be scaled up it scales that up, and developers pretty much have the guarantee that if they need to scale their service up it will be pretty quick scaling effort for them because the infrastructure is already scaled up for them they don't have to worry about it.

The next is operational excellence. In the service manifest file as I talked about developers can define health checks. We also let them define the PagerDuty service key and on all the ecosystem that we have built we keep monitoring their services on their behalf we have a handle on that PagerDuty key if it is crashing and then we can page them so they love this fact that they don't have to do anything, everything gets taken care of on the ecosystem that's built on top of Nomad.

The last is the canary deployment and experiments, which our business users love. Like any other e-commerce company we run a lot of experiments at Jet and we try to figure out what's the conversion related to each experiment so that we can run that experiment longer.

Nomad allows us to run multiple versions of the app on the platform and then we have built a tool on top of it, which we call Phaser which allows you to phase the traffic from one version to another and then we do some things at the proxy layers at the heddles and business users can co-relate what's the conversion impact for each of the version that's running in production and huge vendor gain.

This is how a production cluster looks like. I know it's not a huge cluster in any way but where we are right now is that we have proven the platform and we have proven the ecosystem that we have built around it, and it's a completely self-service model. At this point in time, there's a greater push within the company to move all of these services which are not data stores on top of Nomad and before holidays the idea is to run everything that we accept data stores on top of Nomad. In early November I expect this number to be going up close to four hundred nodes.

As we speak, we have already migrated the front end which is an old app for jet.com on Nomad and we have phased fifty percent of our traffic on that version of the application and before holidays we plan to go one hundred percent and all other services even though they are not serving customers directly will also be running on top of Nomad.

That's it. Do check us out during holidays. We'll have some great deals going on as well and the platform will also be running. Thank you.

More resources like this one

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

zero-trust
  • 12/13/2022
  • White Paper

A Field Guide to Zero Trust Security in the Public Sector

  • 12/5/2022
  • Case Study

Enabling infrastructure as code at LinkedIn

  • 11/30/2022
  • Case Study

How Weyerhaeuser automates secrets with Vault and Terraform