Nomad at Target: Scaling Microservices Across Public and Private Clouds
Nov 19, 2018
This talk will focus on how Target uses Nomad, Terraform, Consul, and Vault to continuously deploy, scale, secure, and manage their services—both in their data center and in public clouds.
On the Transact/Fulfill team at Target, they deploy hundreds of applications across multiple cloud providers. We deploy many different flavors of applications: Microservices, Elasticsearch clusters, batch jobs, Java apps, Go apps, etc.
In this talk you will learn:
- Why Target chose Nomad
- How they automate their deployments using Terraform
- Their need for portable applications—and how they needed to support deployments to Kubernetes and VMs
- Their use of other components of the HashiCorp stack, including Consul and Vault, to support service discovery and secret management inside their Nomad cluster.
Lead engineer, Target
Principal Engineer, Target, Target
Danny Parker: I’m Danny, this is Suresh. Welcome to our talk, “Nomad at Target: Scaling Our Microservices Across Public and Private Cloud.”
Just really quick, there is no official Q&A, so Suresh and I will both be around afterwards for questions, and we also put our Twitter handles here. Both of us will monitor our phones. If you have any questions or anything, we’re happy to respond. Whether it’s today or 6 months from now, if you have any questions, feel free to reach out.
Real quick, before we jump into everything, just a quick overview of what we’re going to talk about. Suresh and I are both going to introduce ourselves. I’ll give a little bit of background on Target and what Target’s been working on recently.
Some of you may not be familiar with Target, as it’s a US-only company right now, so we’ll talk about that. Then we’ll get into Nomad, talk about what problems we had, why we chose to use Nomad, how we architected it, things like that. And then at the end, talk about looking further in the future, how we wanna handle it, what we’re looking to do, maybe how we wanna expand. Some things like that.
Really quickly, again, I’m Danny. So I’m a principal engineer at Target. I work on the Guest Fulfillment Architecture team. Kind of a weird name, doesn’t really explain at all what I do.
That team, we’re in Target.com. If anyone here has used Target.com, if you’ve ever added an item to your cart, if you’ve ever placed an order on Target.com, all of the microservices and the platform underneath that, my team supports: automated scales, etc. And so as you can imagine, the next month or so is gonna be very important for us because of Cyber Monday, Black Friday.
Previously I worked on the team called Enterprise Services, where we built some APIs at Target. I graduated from Purdue University in 2011, so, “Go, Boilers!” I apologize to any Ohio State fans that are here today.
Suresh Krishnan: Good afternoon. I’m Suresh Krishnan. I’m a lead engineer, working for the same team, Guest Fulfillment Architecture, with Danny.
I joined Target 2 years back. Prior to that, most of the time I was working in the Bay Area. I’m a graduate from Anna University in India.
Danny Parker: Really quickly about Target: We’re a US-only retailer. We’ve got about 1,800 physical stores. We also have Target.com, which, as you can imagine, has become very important in the last couple of years with the growth of e-commerce, with the amount of people that shop online now.
I did want to highlight some of the features that we’re working on, a) as a plug for them, because I think they’re really cool; but b) because features like this can be very hard to develop, and it’s a big challenge for our technical teams. And so I want to set the stage, as a lot of these features are the driver for why we look into tools like Nomad, or why we look into tools like the automation, like Kubernetes-Nomad, like Terraform.
These are a lot of those drivers that do that. One is our shipping from stores. If you order something on Target.com, we’ll deliver it to you from a store, same-day delivery. Target recently acquired a company called Shipt, working to do same-day delivery.
And then mobile apps. Like Target.com, our mobile apps have become much more popular, and there are many features that we want to add. We want to scale those out. That kind of stuff has become a big challenge for our technical teams.
» Challenges of the architecture
Target right now is in a place where we have some challenges. Target is a very big company, and so this is by no means true for every team at Target. But, overall, we have a couple different places where we can deploy and remanage, and things are inconsistent.
The first is that Target.com is fully in Google Cloud. Super excited about that. We had Target represented at the Google Cloud Next conference in San Francisco this July. We gave some talks there, so if you’re interested in hearing about how we architected that and how we run different things, I encourage you to check out those talks, because they’re really interesting.
We also have our own data centers. Target has 2 data centers in Minnesota, and that’s like traditional legacy workloads, things that maybe didn’t move to the cloud, or some things that maybe haven’t been prioritized to move yet. And then the other exciting thing is that we’re deploying to our stores. So all of our stores run Kubernetes, and we are working on deploying and doing more and more automation of our store workloads, which is really exciting, but also a big challenge.
I want to talk about our specific team and the issue that we had. So that’s where Target is overall. We have stores, we have data centers, we have the cloud. Then there’s my team, and I want to talk about a specific application that our team has been working on and deploying. It just went into production last week, which is very exciting for us. But that application has a unique characteristic: We deploy it to all 3 of those locations. We deploy it to our stores, which are Kubernetes right now, so it has to be in Kubernetes. We also deploy it to our data centers, and we deploy it to Google Cloud.
I have question marks by those last 2 bullet points because we had to figure out where we wanted to run that. We had a lot of choices, we had some flexibility there, but we didn’t know what to do there. Because we have Kubernetes in the stores, and then we have the public cloud and our own data centers, and so we needed to decide what to do there.
We also had a bunch of supporting microservices, like batch jobs, a Spark job that does some stuff against our Cassandra cluster. We also had some PoCs and some UIs and different things that we needed to support as well. So it wasn’t just the application, but it was many different microservices in a big architecture.
» Keeping things simple
I wanted to touch quickly on some things that I wanted to avoid. This [slide] is probably hyperbole, but this [architecture diagram] was built by one our engineers at Target. But we wanted to avoid extra complexity, we wanted to avoid having to build something that was totally outside the realm of what our developers currently use, and we wanted to reuse as much as we could.
So, what already existed? What could we have done, outside of Nomad? What could we have done that we decided against? One of the talks that we gave at Google Cloud was on Spinnaker, and we could have done RPM deployments with Spinnaker.
Now, there is a bit of a downside there. You have to build a new RPM, you have to bake that image, and you have to deploy it to the cloud. And then, based on if you’re doing blue-green or you’re doing canary deployments, that can take a decent amount of time. It can take maybe 30 minutes; it could take even longer. When you’re iterating really quickly and making a lot of changes and testing and stuff, it can be hard, and it can slow you down.
The second one is Kubernetes. This is the most common one that we get asked about: “Why didn’t you use Kubernetes?” The 2 main things that I say, the first is that, because of the specific environment that we’re in at Google and because of some of the policies that we have at Target, we would’ve had to run Kubernetes ourselves, and that’s very complex, and it’s not something that we wanted to take on.
And then 2 is that we already had an architecture in the cloud, we already had our load balancing and our other things all set up, and we wanted to keep as much of that and reuse as much of that as possible. Whereas bringing in Kubernetes would’ve meant that we change a lot of that stuff over or figure out how to reintegrate that stuff with Kubernetes.
And then Mesos Marathon was available; some teams use it at Target. It was pretty complex when we tested it out. It didn’t integrate well with some of the other tools.
And then, Chef. We still use Chef in our data centers, but it’s mostly legacy, and we didn’t want to move that to the public cloud.
So, finally, we get to Nomad. We chose Nomad for this solution. So to reiterate, we needed to run an application in our data centers and in our public cloud, and we chose to do that with Nomad.
» Why we chose Nomad
Let’s talk about some of the reasons why we chose to use Nomad. The first—and this is probably one of the more important ones to me—is consistency. We have to deploy this application in the stores already using Kubernetes, which means that it’s already been Docker-ized, Being able to reuse that same binary was very important to me and to our team, because I didn’t want to have to build a new RPM or to have to ask our developers to have a different way to configure the binary that they’re using. I wanted to keep that same Docker image.
The speed of deployment was also very important. Our developers are constantly making changes; they’re constantly testing things out; they’re constantly tweaking flags and tweaking different things because of the heavy testing that’s going on for these new features. And so giving them that ability to deploy very quickly was very important to us.
Another big one was that it integrated directly with Consul and Vault. Like I said, we already had a lot of that architecture, in that we already had our secrets involved. Our architecture already relied heavily on Consul DNS and Consul for service discovery, and so being able to continue to reuse those was very important, and it worked very well.
Another thing was multi-region. Outside of our own data center, we have Google Cloud, so right now we’re in US Central and US East (hopefully US West to come soon). But the multi-region support was important in that we really wanted the ability to deploy independently to the regions, but also have that federation and have that shared config and the ability to fail over and all the other things the multi-region support gives you.
And then last was Terraform I’m sure a lot of folks here use Terraform. We use it heavily to deploy. We already have a well-known deployment pipeline built on Terraform. It’s well understood by our developers. Being able to keep that basically the same and reuse that was very important.
With that I’m going turn it over to Suresh for a bit. He’s going to get into more of the technical architecture and more of the details around how we deployed this and how we manage it.
Suresh Krishnan: Thanks, Danny, for setting the context.
» Automating the pipeline
When we decided to use Nomad for our internal use, we came up with a plan like, “OK, let’s automate it.” And so we started with the basic model, to empower the developers. We followed the Git process.
The developers can go to Git and alter the Nomad job, and then the moment they complete the Nomad job and clear the pull request, the CI/CD pipeline will come into picture. We have drones that are managing the CI/CD pipeline. The Terraform plan will validate the Nomad job, and then once it’s good, once the commit, then it applies, and then push the Nomad job to the Nomad servers.
As you see in that architecture diagram, we have the Nomad service covering both regions, but it’s much independent in both regions, and it’s federated across. So it is gossip among 2 regions, but it’s totally independent.
In this model, we have a 3-node server clustered in each region, and then one of the servers will act as a server leader, and the remaining 2 will be server followers. It pretty much follows the consensus model. The moment the job is triggered to be scheduled, the Nomad servers will have a consensus to schedule on which nodes to go.
As the illustration shows, we have separate Nomad clients that are deployed across 2 regions, the US Central and US East. Then, based on the consensus among the servers, the job will be scheduled in any of the nodes. So the Nomad binary is running on each of these nodes, and it’s totally an immutable agent.
Another cool feature—we have it right now in stage, not in production—is we have pre-emptable nodes that are pushed across to the VM. We want to make sure we recycle on the cost basis to cut costs, and then we adjust to any of the changes, transferring to the application running.
We have a similar setup for our private cloud..
» Behind the scenes of job scheduling
We talked about how we are managing the microservices in the Nomad environment. Once the job is scheduled, what happens behind the scenes? Once the job is scheduled, Consul has the service discovery for which service is running, because we have multiple services running on the nodes.
Vault provides the application secrets, so in this paradigm, any request that has to go through, we have an HAProxy that acts as a reverse proxy, and it gets metadata about the services from the Consul DNS. And the way it flows is, dynamically, it gets the details and then it knows how to route based on URI context.
To magnify what is within one of the Nomad client nodes: What we have could be any number of services running. It could be a Spring Boot application, or it could be a Golang Docker container application. It doesn’t matter at all. Any request from a user or any application request goes through the HAproxy, and then once it hits one of the nodes, the request is processed by the respective services. But, as we know, microservices are kind of a service mesh. I mean, it has dependencies on calling other services. So we have Fabio that we deployed as a Nomad system job, it routes the internal clock, getting calls to other services. It’ll be able to route through Fabio. Fabio gets the same metadata about other services from Consul. So we are dependent on that and integrated with that.
It also changes. Let’s say there’s any scheduling to different nodes. HAproxy gets updated, and it’s seamless and transparent while it’s running. We haven’t done any kind of config, hard coding, or any such.
The developers like adopting Nomad as the Vault, providing secrets. And applications can consume it at the time of deployment, so it can get it dynamically at deployment time, and then it gets all the other metadata, the application key value passed from Consul. So it seamlessly integrates with our Nomad job, and it’s the whole flow we are managing right now.
But Fabio, right now, we haven’t deployed in our public cloud. It’s still in the PoC phase. We did try the other products like Linkerd, and after comparing the features, we came down to Fabio. But we haven’t deployed in the public cloud yet.
So, what is some of the feedback we got?
» Faster deployment
When we rolled out, speed was one of the constraints. We want to get applications quickly out there. Based on our experience with Chef, we learned that it does take time when we want to deploy any new version or any new application. This cut short a lot, because the moment you open a Nomad job and comment, the pipeline takes over and deploys in a matter of minutes. That was one of the key driving factors for developers to support Nomad.
» Blue-green deployment
We managed blue-green deployment at the DC level, and we tried to route traffic to node 2, pass some more, and then tried to get a new version. Nomad helps to do it on the same DC level, so we can finally get the newer version out, and then slowly ramp down the older version, pretty transparently, with minimal downtime. That’s one of the key advantages we have.
» Managing application properties
The Nomad job provides application templates, so the application can define the properties, and any of the dynamic properties can be defined as a Consul key value, and it can be pulled dynamically. That’s one of the key advantages: We don’t have to maintain across depositories to manage application properties.
» Seamless CI/CD integration
We were able to elaborate the existing pipeline, and then we started deployment, and it was pretty seamless, and we were able to bring it up to speed. And then the developers can take control of any newer version, test it quickly, and then I trade through. All these things go seamlessly and pretty fast.
» Functional UI
The functional UI gives developers more key insights: how the allocation went through, and the resource usage. Because one the keys is they can define the Nomad job, the resource requirement, how much memory, and then the CPU and so forth. So they can look at the resource usage and then visually fine-tune the Nomad job.
» Ease of use
Nomad documentation is awesome. HashiCorp has done a good job maintaining the Nomad job description. Everything, each and every instance, is documented with examples. Ease of use is very good, and the config visibility.
One of the requirement is, once deployed, how to look at the configuration details quickly. The functional UI provides that, so you can look at: What are the configuration parameters we have taken care of? What is deployed? And what is out there? If there’s any change, they can quickly update the Nomad job and push the changes through the Git process, empowering them. So the configurability is one of the other feedback the developers like about the tool.
And so these are some of the feedbacks which really drove us in order to move forward and then currently leads to operationalizing our production.
» The Nomad numbers
Danny Parker: Let’s talk about some specific metrics and statistics of our Nomad deployment. To start, the picture: It ended, it looks like, September 30, so there are still 2 or 3 weeks of Git commits. That’s what that is, the number of Git commits to the repo.
You can see, in the beginning of March, that’s when we were doing a PoC, testing it out, and then around the June/July time frame we really opened it up for developers. And so I’m going to be very interested to see—this for our Terraform repo, by the way, for the number of changes—I’ll be very interested to what that looks like long term.
To reiterate, we have Nomad in prod, on premises, and in cloud. We’re doing many deployments per day. I don’t have the exact numbers, because they change so frequently. Depending on if there’s a new feature or something breaks, you can deploy 50 times a day or once a day.
In addition to the app that my team is specifically using, we also have some other teams and some other things running in Nomad. One of the things that I think is pretty interesting is that we’re running a full ELK Stack (Elasticsearch, Logstash, Kibana) in Nomad. We’re ingesting around 300GB per day in that cluster, and so we were able to bring that up very quickly. Elastic provides really good documentation on the Docker images that they use for Elasticsearch. And so it’s given us a great opportunity to test out stateful apps in Nomad as well as stateless.
[As for] our clusters themselves, we do have 3 production ones: 1 in our data center, and then 1 in each region of the cloud. Right now, we’re only running about 8 agents, because we want to get that efficiency, the bin packing, and we want to share the allocations across the agent. I can tell you right now that that will be scaled up for peak, because it just went into production, limited use; it’s still a small amount of guest traffic. But that will be scaled up as we move closer to Black Friday and through the holidays.
» Faster deployment with Nomad
So, what did Nomad immediately provide the development teams at Target? The application example that I used, the one really important one, was that we went from a PoC to production to now scaling up before Black Friday, all within a couple of months. And that was really useful for us, and it was amazing to see. And it wasn’t just that application.
We now have an ELK cluster, we’re running some Telegraph, ingesting, and different things, all in Nomad. It’s given us the ability to do some PoCs and to test some things out very quickly and very easily.
The iteration factor was an immediate value-add for us in that before we were doing Spinnaker RPM deployments or Chef deployments, and we were unable to test some of these changes and test these new versions of APIs very quickly. And so moving to something that just allowed us to deploy within seconds was very useful.
And we now have multiple teams that are using our stuff. Some teams have seen how quickly our developers are able to use it and how much they enjoyed it, and they have asked to use it as well. And so we’ve offered it out to other teams, like, “Hey, you can use this to test your stuff, to very quickly get deployed, and to have a working example of your application.” And so that’s been really cool to see.
» Any future plans?
So, what are our plans for the future? What do we want do with Nomad long term? What else do we want to look at?
We did want to look into auto-scaling. It sounds like that’s going to be included in Nomad at some point soon. That’s one of the things that I really want to look into. And not just auto-scaling the jobs, but auto-scaling our agents. Because you can auto-scale your job up really high, but if you don’t have the underlying agent capacity to support it, then it doesn’t help you.
Also, inter-job dependencies. A lot of our apps that we deploy are microservices that are highly dependent on other microservices in the same Nomad cluster. I think it would be really cool to see some sort of job dependency-type thing where, if one job got upgraded to a certain version, then we can keep another job that only connects to that version, things like that. We have some weird version dependencies and stuff that it would be interesting to see.
More accurate resource utilization. Nomad’s pretty good at telling you what it’s using, but we need to get better at that. Teams still consistently request more than they need. We need to get better at figuring out ways to track that, figuring out ways to go back to those teams and say, “Hey, looks like you’re only using 1GB and you requested 2GB.” And I think I saw something in the Nomad 0.9 that said that it would automatically tell you that. So that would be really interesting.
More ACLs. if you saw the chart in the keynote about the simple-but-less-secure versus the complex-very-secure, we’re probably closer to the simple-not-very-secure, in that we really just have 1 ACL right now. Everything’s supposed to go through Terraform, so people don’t have direct access, but we still want to get better.
We’ll look into adding the Target data center as a region; I think that could be interesting to do.
And last, Consul Connect seems very promising. As Suresh mentioned, we’ve tested out Fabio as a service mesh, we’ve tested out Linkerd, we’ve even talked about putting HAproxy inside there. So using Consul Connect to manage that and secure that. Maybe with Envoy, maybe with HAproxy, maybe something else, but it’s something that I’m really interested in doing.
So that’s everything.
Thanks for coming. Thanks for missing the beginning of your lunch to see this talk. And feel free to reach out; we’ll be around. I’m here the rest of the week, so feel free to stop by and talk. Thank you.