Case Study

Elsevier's container framework with Nomad, Terraform, and Consul

Watch a live demo showing how Elsevier put HashiCorp Nomad into production, to schedule their containerized workloads. The framework also uses Terraform and Consul—with additional enhancements written by the team.

The technology industry is adopting Docker at an ever-increasing rate. And with hundreds of complex web services and distributed systems, over 30 million users worldwide, 7,000 employees and more than 1,000 technologists spread across 12 countries, the desire to do so within Elsevier is no different.

With more development teams adopting microservices architectures, Elsevier needed to develop a container solution. Adopting a container platform, an inherently complex task was made all the more difficult by the requirement that each development team be able to deploy, run and maintain their own clusters, with little to no core infrastructure support.

Faced with this challenge, the Elsevier Core Engineering team embarked on a 3-month project to design and build a framework that would meet the stringent set of requirements provided by development and operations teams. The team used HashiCorp Terraform, Consul and Nomad—with additional enhancements written by the team—to provide an operationally simple, scalable, fast, fault-tolerant framework that can be deployed to production standards in under four minutes.

This talk features a live demonstration of the platform, the new deployment features of Nomad 0.6 and the custom services written in Go by Elsevier that provide fully automated scaling of Nomad jobs and agents.



Eric Westfall: Hey guys, my name's Eric Westfall, this is James. We're gonna talk to you guys a little bit about Elsevier's work around productionizing Nomad for our container workload. It's gonna be a demo heavy talk, so we're gonna blast through these slides really quick. So hopefully you don't get whiplash as we go through these.

Like I said, my name's Eric Westfall. I'm the Principal Engineer at Elsevier's Core Engineering team. This is sort of a slide that we … we always like to call ourselves like Jobs and Woz, but everyone sort of calls us something else. So this is probably more accurate.

James: Hi, my name's James. Again, Principal Engineer, based in London. Eric's based in the US. So, we pretty much have done all of this through conference calls. Lots of conference calls.

And so, what do we do? What do Elsevier do? So Elsevier's primarily kind of a research company. We help researchers with a plethora of tools that we have. And in Elsevier we've got hundreds—many hundreds—of complex web services, applications, distribution systems. We serve around 50 million users worldwide. And to do that we have 7,000 employees and that equates to around 1,000 engineers, spread across 12 countries, working in a variety of roles, QA, development, and more systems engineering that we do.

And we're part of the Core engineering team. Our responsibility is to build reusable frameworks and toolings that the entire enterprise can use. Such as repeatable commodity, consumable stuff.

Our requirements

Eric: Just a quick little glance at some of our requirement-gathering process. When we're doing these cluster-scheduler decisions, we had to weigh out all the different options between Kubernetes, Mesos, and Nomad. So before we do any large project like this we always gather our requirements from our stakeholders. Figure out what we're trying to achieve. And then we rack and stack the solutions to come out with a POC.

And naturally, Nomad came out on top after we put them through a fairly rigorous set of requirements testing. I was a big fan of Kubernetes so I was quite surprised that Nomad did as well as it did when it stacked up against the Elsevier requirements. So I'm just gonna talk about those a little bit:

1. Dev teams need to run and operate their own clusters

Like James mentioned, we have a bunch of different distributor teams. I think we said a little over 7,000 employees. There's about 1,000 technologists in Elsevier, across a bunch of different countries. Some of them are siloed in like 40 different development teams. All of them responsible for their own applications and services. And their own deployment models and operations as well.

So when we looked at this, we really had to say, something had to be operationally simple and deployable by those teams independently within their own constructs. So, we couldn't do a massive pass-layer solution. That was really key and one of the big differentiators between Nomad and Kubernetes for us at the time.

2. Fit into our existing platform and deployment methodologies

The other area: We wanted to fit into our existing platform and deployment methodologies. Meaning, Elsevier was very heavy on the HashiCorp stack before we picked Nomad. So we were using Terraform, Packer and all those things. We wanted to make sure that what we deployed was automated and could be reusable. So what we did was just take the Terraform modulator approach, with the Terraform modules we're demoing today to do all this deployment.

3. Must be flexibile and allow for customization

The last thing was flexibility and customization. This is where Nomad really shines for us. The APIs and the extensibility of Nomad are pretty much endless. If it doesn't do something, you can figure out a way to do it. That was key for us and hopefully we'll show you some of the things that we've done to extend Nomad to meet our requirements.

Our solution

James: So what was the solution we came up with? So as Eric said, we went for the Nomad and at this point, I am going to try and deploy the cluster. So we don't have a cluster running. I will try and prove that.

Eric: What we're gonna do is a deployment today of the cluster from scratch. And hopefully all that stuff will be ran through Terraform. And generally a cluster can go from nothing in AWS to a full production-ready cluster, automatically boot strapped in about four and a half minutes. So that's usually what we find.

James: So I'm not lying, we don't have a cluster running. Our claim is that we can do a new cluster in four minutes. This includes a quorum of three nodes and two distinct worker pools. We don't really go into distinction of worker pools during this, but we're happy to talk afterwards about the design that we kind of came up with recently, we think it works really well.

So as Eric said, we do everything in Terraform. Everything's a Terraform module. This is no different. It's actually going to be … well, we've got three modules, but they're two modules imported three times. We have a module for the servers and then a module for the worker pool. So we're gonna do a 186 things and let's hope this works. And that go away and we'll just keep going through this. That works.

Just out of interest, who uses Mesos here? Surprised. Mesos was really disappointing to us, really disappointing. I'm guessing most people Kubernetes? Nomad? More than I expected, awesome.

Production-ready containerization

Our solution goal was to build a production ready containerization platform that developers can pick up, they can run the Terraform modules with a little bit of public code and they just get a cluster. That was the end goal.

“Easy and fast to deploy a production-ready container cluster”

So as I said, deployed in four minutes—cross fingers—minimal Puppet integration at run time, just to do some run time configuration. The majority of the time is Amazon boot up time. So we're working around three minutes for Amazon's boot instances. That's the same on the auto scaling which we demo.

“Complex functionality, presented simply”

This was actually a quote from one or our architects at work. He was fairly impressed. The cluster has many features that we've added. We've got advanced health monitoring, auto scaling, some pretty simple deployment pipeline examples for our development teams to use. Zero downtime rolling upgrades of all the cluster binaries.

You can do it in an immutable way using an AMI and roll the AMI through the cluster. We also offer a way where teams can just run a very steady update across the binaries and we manage workloads and make sure there's still quorum. And full backup and restore DR strategy. We can't have a production system without that so we had to include that.

“Highly resilient, stable and fault tolerant”

And highly resilient. Consol and Nomad take care of that pretty well by themselves. During negative testing when we got RMVP up and running, it was very hard to break. And it was one of the good reasons why we picked it. Even with the stuff we have in place, if we only have it in a single region, we lose that entire region. The way we backup and the way we've got everything in code, we can bring it up in a new region in a matter of minutes. Downtime's pretty easy to handle.

But of course, Nomad isn't the be all and end all. There's a lot of subsidiary things. So Puppet Amazon is our primary cloud provider. Terraform modules, we got some custom Go daemons that we'll demo. And Faveo, if no one uses that, no one knows it, Faveo's great. And it's that default that we use out of the box. Doesn't mean that we can't swap that for traffic or envoy if you want something slightly different.


Eric: Now we're going to go check on the status of the cluster build.

If any of you guys were in the earlier talk today about Nomad, we're going to demo some of the new six features that were highlighted as far as the ability to do canary deployments, rolling updates, blue/green, as well as a pretty heavy focus on something we're calling Replicator—which is a Go daemon that we've written and open sourced that provides cluster and job auto scaling in a single daemon for Nomad.

So we're gonna cross our fingers and hope this all actually worked the way it should've. OK, it looks like the cluster … the Terraform run finished and we will check and see if we have a Nomad leader. And we do.

James is just going to do a little bit of setup here. We're just gonna tunnel in, all of this is running on Elsevier's private AWS VPC so none of it's publicly accessible so we'll set up a SSH tunnel to give ourselves access into the Nomad cluster itself.

James: The first one we're gonna do, I'm just gonna deploy Faveo. We want to come up with a way to auto load jobs into Nomad. It's a bit annoying so we're just gonna deploy Faveo. We deploy Faveo as a system job, so it means that Nomad will manage running this job on every node in the cluster that meets the requirements.

It's a really nice feature, especially when we scale out the cluster, we don't have to make sure that this job, we rerun the job and it puts on the new nodes on the cluster. Nomad will actually manage running that job in the cluster for us and on the new nodes, it's really nice.

Eric: So as you can see, we used Jenkins to orchestrate all of the CI/CD pipeline for Nomad. Our developers already used to using Jenkins, so it's something that we thought was pretty important not to sort of switch things out for them. So we abstract away all of the Nomad API stuff and sort of just give them the same workflow that they're used to. It just happens to be a ton faster when they're using Jenkins.

Anybody else using Faveo today with Nomad as a routing layer? Yeah, it's fantastic. Everybody seems to love it too. If you're not familiar with it, definitely check it out. Zero config routing is awesome.

We can see that Faveo's successfully deployed another cluster. Like we said, we run it as a system job, so we've got four nodes and two distinct worker pools and Faveo's running on all four of them.

So now we're gonna move on from this and get a deployment of some applications. So throughout this thing, we're gonna deploy something that we call ceapp. So ceapp's just a little go web daemon that we've written our web service that starts in HTTP server and sort of responds to a version endpoint and spits it out.

This app, the first thing we're gonna do is a canary deployment. So if you guys aren't familiar, this stuff came out in Nomad 6, it's a fantastic way to sort of control your destiny when you're updating versions here. You're gonna handle deployment, decide if it was working well. Once you're comfortable with it, you can promote it. So I'm just going to run a quick curl here. This is just gonna hit that web service version endpoint.

As you can see, we're spitting out version 1.0.0 over and over again. Then we'll take this deployment and bump it to two. You should see zero interruption to that. And should be very seamless.

Then you'll see, as you guys are probably familiar and from the talk earlier about running at scale, you can kind of control these update stanzas, how many are allowed to happen at once, what the stagger interval is. So this let's us really run at a very production safe way too.

I'm just gonna update this and this should deploy the 2.0.0 canary. As you can see, we've got one canary out there now. Hasn't touched the version 1.0.0 endpoints, but you'll probably start seeing a little bit of 2 flutter in there a little bit. This is all happening dynamically with Faveo.

So Faveo's picked up this new canary. We'll route a very small percentage of the traffic automatically to the canary and we can watch that. Operators can look at the logs, make sure it's healthy. Then we'll promote it once we're ready.

James: Yeah, these are really nice features when 6 came out. We were doing a lot of work with our internal development teams and as soon as it came out, we were putting this into our demos, our internal demos and it was a big feature that development teams kind of stood back and went, "OK, that is really nice. That is really helpful."

Eric: So we're just gonna do a manual promotion of this deployment and this will take and it will follow the same rules. So whatever's in our update stanza, whether that be whatever the stagger is and things like that will be observed. And Nomad will automatically deploy all the rest up to 2.

James: And we're not dropping any traffic, which is exactly what development teams want. Now I'm just gonna purge this.

Eric: Shouldn't be doing this very often in production, but for the demo purposes, we're gonna purge the job out of the history so we can do a blue/green deployment now.

This is gonna be a traditional blue/green where we essentially manipulate the canary process. So if you have three running production hosts on version 1 and you wanna do a traditional blue/green deployment, then you deploy three canaries so that it's even on both sides.

James: Yeah. Our most common pattern is rolling deploy. But having these available for maybe some of the more advanced development teams is really helpful.

Eric: It's highly dependent on how healthy or how robust a service's health checks are really, more than anything. If you can trust that the application's testing is good and you can have confidence in the health check, rolling's where it's at. But a lot of teams really prefer controlling the …

James: And now we're going to go backwards from 2 to 1 'cause you do that a lot in production. Particularly we do. I forgot to change my atom setup. So if that deployment is finished. Yeah, that's cool.

Eric: Again, same construct here though. So we're on a canary, we're gonna manually deploy this.

The other thing that our teams have just been blown away with is the speed of Nomad. If you do this stuff on traditional EC2, a blue/green deployment might take you 15 or 20 minutes. Even in a normal small immutable ASG, it can be up to five to 10 minutes. And rolling back is terrible.

In these scenarios, it's so fast that they don't really hesitate to do deployments now.

James: Yeah. And you'll see now in the code, we've got the output of 2 and 1. Really nice. Nothing's going wrong apart from my shoddy typing. But this one, we're just gonna fail it. Let's say we've found the problem and it's really nice to be able to do that.

Eric: Yup. So again, we're gonna pick this deployment, set it as failed. Again, this is gonna mimic something we've went ahead and deployed this without testing properly and now it's all gone south.

And we're gonna revert back to the previous version, which in this case will actually be 2.0.0. This is the complete opposite of what you'll ever do, but yeah.

James: And yeah, Nomad handles that perfectly. So stops blue or a green deployment. Increments the version onto the healthy, the stable version, and just leaves that in place. And we keep an eye on that and one's dropped out, we're back to our original 2 version.

And really nice, it gives us a lot of flexibility to just go to development teams. A lot of what we do is, as I said, this generic kind of setup and with flexibility for developers to then kind of add and pick what they choose, which helps them.

Eric: Okay. So we got about 20 minutes left. So the rest of this is gonna focus on some of the scaling features of Replicator. So Replicator is a system daemon written in Go. It can run as a container and a docker container job. It can run as an exec job directly on the Nomad host. Or if you want, it will run as a system binary as well on the server nodes or whatever you like.

First thing we're gonna do is deploy Replicator via the Jenkins CI pipeline. Again, this is one of those things where we're looking at is this whole auto load business—which is there are certain things you always want running on the Nomad cluster and they come up. So it's something that we're looking at as well. It's just figuring out a workflow that gets all this stuff bootstrapped on a cluster from the beginning.

James: We use this small helper script in this, a small daemon. Just wraps the … does some templating and just wraps the deployment API with a watcher. So we can get a bit more feedback from developers in deployments. If you just use a Nomad binary, Nomad runs you and you don't really get much feedback on it. So this was something that has been pretty helpful.

Eric: And that'll be open sourced as well pretty soon.

James: It is already open sourced.

Eric: It is, yeah, okay. So that's a project we're calling Levants and it's on James's GitHub, but yeah, it's really nice to do sort of variable templating and monitor deployments as well.

James: We got Replicator running. I'm just gonna grab it from here. And we can just have a look. So Replicator, you can run as many of it, but it does leader locking. So you can have 20 of them running, they do a spin lock, grab leadership, and then that one that's got leadership is the only one that's allowed to do any enforcing. And the others will just sit there temporarily.

It keeps trying to get the lock until if they can, then they take over leadership.

Eric: Yeah. So we're gonna deploy a copy of ceapp. This time we're going to deploy a very large number of these just to try to force the worker pool that's running it to run out of space.

What we're gonna do is watch Replicator do its job, which is to detect that this worker pool is out of space. And it's going to automatically increase the size of that worker pool dynamically based on the load that we're pushing onto the cluster.

Which is something that at the moment, that you can't do natively, which is sort of automatically respond to dynamic workload changes and meet that need. The reason that we address this separately from say put it in an auto scaling group and just scale on CPU or memory is how do you decide in advance what's safe? And what your workload's going to dictate scaling?

We have a lot of different tunable algorithms inside Replicator that allow you to dynamically … it does all the work for you. So essentially, you don't have to get an advance and say like if the cluster hits 80% CPU, scale out or scale in. It figures it out based on whatever workload's present on there automatically.

As we push this job, it's going to strain the cluster and Replicator should detect that and try to scale it. So as you can see, we've already placed 10 of these. This is a very small worker pool running on T2 micros, so they don't have a lot of space. So just pushing 12 of these things should stress the cluster enough.

Here we can see that Replicator's automatically detected that the worker pool requires a scaling operation. In this case it's saying that the direction is a scale out. It currently has two nodes. We can get into a lot of the details later, but the algorithm that it's using right now prioritizes the most constrained scaling metric on the worker pool and calculates capacity.

It allows you to do things like say, on a production worker pool I should always be able to lose one of these nodes without having any impact to my workload. So it takes that into consideration when it's making these things, these decisions as well as cluster scaling overhead for the job.

So this is the one thing … several people have asked us why we do job scaling and cluster scaling in one project. Having them in the same project allows us to make a lot of decisions that we couldn't otherwise do. One of those is look for any job that Replicator sees that's scalable, that has a Replicator scaling policy on it, and consider that during the cluster scaling bit.

I might scale these jobs out by one, based on workload, so I'll take that as a scaling overhead and reserve that threshold as well. So it gives you a lot of intelligence on what might happen in the future and allow you to consider that when you're doing scaling as well.

James: The nodes policy—that's always running as a watcher. There's job scaling policies, they're running on watchers and they always get the latest information when there's an update. It means that we're always … if you deploy for new jobs and they're all to scale, the cluster scaling part of Replicator is automatically and instantly aware of that.

So we're always balancing the cluster size. It means in dev as well if you want to run really lean in dev or staging, you can have your node fault tolerance at zero. It means you're just not gonna have any fault tolerance, which is fine maybe in some environments.

Eric: This is a good point is you can see in the output here, we have two worker pools running. Like we said, ones meant for private workload, one's for public workload. They just have different security groups around them.

But these are all running in multiple Go routines or threads. If you're not familiar with Go … sort of like threads. But it's all concurrent—these are always running and it's incredibly fast. Even here where something might take a while because we're waiting on AWS to do work.

What we're gonna do here is increment the auto scaling group, interact with the API to figure out what's the most recently launched instance, which we did. Then we'll use that polling mechanism to wait for that node to actually successfully join in the Nomad cluster itself.

So all this stuff's just gonna happen automatically and if it detects a failure during one of these—which it didn't in this case, it successfully joined the worker pool—but if it hadn't, it would detect that and take the appropriate action. It would terminate that instance up to a maximum retry and keep retrying to scale that out for you automatically.

If it hits that max retry, then it can save that node off to the side for you, let you debug it and notify via whatever configured channels we support: Things like PagerDuty and generic SMS and things like that. So we're gonna add more support for all sorts of things that Datadog and things like that as well.

James: Yeah, we have a concept of failsafe. Works with both the cluster and the job scaling. If we try and scale either the cluster job and in this case the cluster, if we failed three times or whatever you've configured, it will put that worker pool in failsafe. So Replicator will kind of ignore the scaling that it wants to do for that.

The idea being we don't want to keep trying to scale, keep bringing nodes in and out and make that worker pool even worse than it already is. So the idea is we put it in failsafe and we alert out to an operator and an operator actually has to remove that lock through a CLI command. So we have it so it's supposed to look after the cluster as well as scale it.

Eric: The next thing we're gonna do is show you job scaling and we're gonna use this as a two part thing. So this next copy of ceapp will actually have a job scaling policy included in the Nomad job file.

What this is gonna do is allow Replicator to … if you remember we deployed 12 copies of this and we certainly don't need 12 copies. It's not gonna use it. It's gonna scale it inward until it hits its minimum threshold, which we would find. Then as a result of that, Replicator will automatically detect that we're way over allocated on that worker pool as well and scale in the cluster as well.

So again, this is exhibiting sort of one, the jobs can scale themselves based on load, which we're saying here. In this case, scale in if the memory is 30% or lower. CPUs 30% or lower, which it will be 'cause we're not doing anything with it. So naturally these jobs will scale into their minimum until Replicator isn't allowed to go any further.

All of these config parameters for cluster scaling as well as job scaling live right in the Nomad HTL file for the job. So they're all meta parameters. So you don't have to have an external config or do anything weird like that. It lives with the job config as well.

So we'll deploy this and we'll watch Replicator almost immediately start taking action against this. Now the job scaling is much quicker 'cause Nomad's much quicker than AWS's API. So you'll see this almost immediately start cutting the ceapp job count down by increments almost immediately. As soon as this deployment finishes.

James: Like I was saying, the job policy watcher picks it up straight away. It manages the complete lifecycle of job policies. So if you update one, it will pick it up.

If you deploy a job, it will have a look at it. See if there's a change, if there is, update it. If you stop a job, it cleans it up. If you remove a group from a job, we also do that. We do an orphan group check. We try not to … these are pretty fluid environments, we try not to leave anything behind, the orphan pieces in our structure is less than ideal.

That'll take a few minutes to deploy just because we're doing a staggered update on a job that's already running. But it should take about 10 more seconds.

What will happen, Replicator will start analyzing that job. It's within it's configuration pool. And we will scale a job. At the moment it's hard coded to an increment or decrement of one. Where opens adding kind of configurable things.

It's kind of like Amazon auto scaling. If I reach my CPU maximum, I want to add two 'cause I want to be safe. At the moment the scaling happens so quickly, we run the daemon every 10 seconds. It happens so quickly, adding one, taking one out, that's quick enough to handle any burst in load.

Eric: As you can see, it's already taking scaling action on the ceapp job. It will continue to do so until it hits the minimum. So we have this concept of safety gates that existed. So Replicator constantly, if it detects a scaling operation is needed, it will keep trying. If it's not allowed, the safety gates will block it.

There's a billion different things that we do for safety checks. We can talk about that later if you guys are interested. But essentially, what you'll see is this will just keep trying to initiate scaling operations until it's told it can't do it anymore. In this case, the minimum threshold I think we set was two. So once we're done here, we'll look at the Nomad status of the job and it'll be two instead of 12.

At which point, the cluster scaling will detect that we're way over provisioned and take action. Now there's things like cool down periods and all sorts of other things like that.

Replicator isn't allowed to just immediately start scaling the cluster if it's only been two minutes since we've done it. All that stuff's tunable. There's tickets that we have open to add other protections for AWS instance launch time or AZ balancing and things like that as well.

While we're waiting for this to finish, we're getting ready to merge a PR that will make this a provider model. So all of the cloud provider specific stuff that works for cluster scaling will be done in a similar way to how Terraform's provider model works.

We'll be able to plug in things like Google Cloud, Azure, and things like that. As well as some more generic scaling things. Like if you want to run Terraform instead of actually interact with the cloud API or something like that.

James: And once we're doing the scaling in the background, console health checks, and Faveo are just doing their job. I can quickly jump onto the Faveo UI probably.

Eric: So as each one of these decrements happened, we should see that Faveo automatically starts just dropping them off the route table almost instantaneously and reweighing the routes.

James: Faveo is a brilliant piece of software. Works so quickly out of the box.

Eric: So I think our cool down period for the demo is maybe like, it's probably still like five minutes. It won't be allowed to scale in any further than that.

James: This will scale into two, eventually. When we're doing the previous deployments, we've not lost any traffic. Faveo console, everything we're doing is keeping the cluster steady, stable, it's doing its job.

Eric: And the other thing that you notice probably, why the original Jenkins deployment was running, one of its safety gates is if the job is in a deployment mode. Here you can see, so it's already reacted to the cluster over-allocation. So now the direction is in instead of out, so we're gonna scale down from three nodes in this worker pool to two.

So when we're scaling in there's some different decision logic that happens. The provider, the cloud provider itself will work in tandem with the Nomad API to figure out what's the best node to terminate. So that logic is open to change here in the future. But at the moment, what we do is we take the least allocated node from the API's stance.

Like I mentioned the algorithm does a bunch of stuff and we won't get into all the details. But it takes a prioritized scaling metric: What is the most constrained resource on the cluster? Is it CPU disk, network IO, memory?

Whatever it is, it figures out which of those metrics is the most tightly constrained and uses that as the priority. Then looks at all the nodes that have workload on them and figures out which ones least allocated. That's the most eligible node for termination, so that one floats to the top and that's the one that we pick.

We then put that node in drain mode in Nomad and wait for the allocations to move to other worker pools if they have any. That'll pull and wait for that to finish. Once it has no active allocations, which in this case we see, the cloud provider will terminate that node by calling out to the scaling API and in this case, we'll confirm that the instance gets terminated successfully.

Which can take several minutes 'cause the ASG API is a little bit slow to respond on occasion. So you'll see this do basically nothing while it sits here and waits for Amazon to actually respond to our request. Usually this takes maybe two to three minutes for it to catch up and realize what we asked it to do.

What I was saying earlier was that one of the safety gates for the job scaling perspective is that if … especially in Nomad 6, with the concept of deployments. If a job is in active deployment mode. If for example, you role some canaries and you haven't promoted those, Replicator won't take any action on those. So it doesn't pull the rug out from underneath you and start making changes while you're testing a deployment or something like that. So it's nice.

So as we can see, the watchers which are running in the background, instantly polling the Nomad API, see the change, and have now dropped the node discovery count to two for both worker pools. So we're back where we started again.

That demo took us from two, put some workload, it scaled up automatically based on the workload, then we scaled the workload down dynamically, and then scaled the cluster back into where we were before we started the demo.

James: Now we're just gonna do a rolling deploy of the scaling ceapp to see the final few features of Nomad. So we're just gonna do two. This is gonna be really quick because we've only got two instances of it running.

Eric: This is the traditional deployment that you were probably already familiar with. But this is the rolling automatic deployment. Then after that we'll look at the auto revert.

James: This will take a few minutes. We've got things like min healthy time, unhealthy time in there, stagger, auto revert. This should be fine. Version 2 does exist. And we'll see is already done. The job's still running just because the deployment has not been marked as finished. It's still running. It's still doing it's final health checks. And there it just finishes.

Eric: Which I think is a good point. For production workload, it's really important to use some of those new features that are out there in the update, the clause is controlling your healthy limits and things like that are very important. So that you can be confident.

What we're gonna do now is deploy a version of this app that doesn't exist all. So this should fail miserably and auto revert is enabled. So this should try and fail and go back to 2 and not really take any outage at all.

James: It's the most reliable way to break this. Obviously if you're in production, you'll deploy something and maybe it doesn't pass it's healthy threshold because of an unknown reason. This version 3 just doesn't exist. So it should bomb out pretty quickly. And there we go: Job failure. If we just go on back and look at the cluster. Let's get rid of that.

Eric: This is where job versioning really comes into play. So what you'll see here is that we actually did bump the … the version now, we're in version 14 of this job and we started at 0. So that auto revert automatically bumped the version again for us.

James: You see that it failed because the allocation was unhealthy. It rolled back to the previous known good state. And we shouldn't have lost any traffic. And we're still running version 2.

Eric: Yup. So that's all we got for you I think. We have open sourced Replicator. Definitely open to some community feedback. We'd love to hear what you guys think about it. Gonna try to work with some people and get some more support out there for different cloud providers and things like that. And definitely some other scaling back ends as well.

James: Yup. But yeah, any feedback would be great.

Eric: Thank you very much.

James: Yeah, thanks.

More resources like this one

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 1/5/2023
  • Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

  • 12/22/2022
  • Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

  • 12/13/2022
  • White Paper

A Field Guide to Zero Trust Security in the Public Sector