Learn how PagerDuty plans for and upgrades Nomad, Consul, and Vault with no downtime to their customers or engineering.
Welcome to my talk here. We're going to talk about the incredibly exciting world of software updates. Woo! A little bit more serious. Two of PagerDuty's core values are Ack + Own and Take the Lead. This is me doing so because knowledge sharing is good, and upgrades are lonely if you do it by yourself. I'm going to try to keep this light and fun because this talk could be really boring.
With a bit of an overview, let's talk first about me, then about some corporate shilling for PagerDuty, and then we'll go into our tech stack. Then, we'll talk about planning, groundwork, and executing a successful upgrade. After I've talked your ears off about all of that, I'll go into some specific lessons learned about our upgrades of Consul, Nomad... Oh, Vault... This is not good. Sorry, I meant these three.
Hi, I'm Todd. I've got a logo. I work on the site reliability engineering team at PagerDuty in the infrastructure org. I'm an amateur blacksmith and metalworker. I live in Vermont.
You probably know PagerDuty. We wake you up at night, and we're good at waking you up at night. But honestly, we do a lot more than just wake you up at night. We want to do more with your data and pioneered the modern Operations Cloud through tools like process automation and our integrations with Salesforce and AI tools. We want to help you do more with your data, map it back to business use cases, and get you back to bed faster.
Some facts about the PagerDuty infrastructure. In August, when I pulled these, we had about 550 Terraform applies in our production environment. We had 230 deployed microservices and approximately 2,500 releases to production in a month. That's 85 deploys a day to either of our production environments. This was backed by about 700 EC2 instances.
This is all backed by zero maintenance windows. You will notice on PagerDuty we never advertise maintenance windows, and we build our software to support that. All of this, and so much more is supported by less than 30 SREs across four teams.
Let's talk a little bit about our HashiStack. The primary point of this is not actually to talk about our HashiSack. It's to talk about how we're updating our HashiStack. For simplicity, we use Nomad for containers. We use Consul for service discovery and key-value pairs. Woo! And Vault for secrets, who would've known? We use Chef to opinionate things about it, and we run on top of AWS.
For more context, the team I'm on, Cloud Infrastructure, is responsible for everything on that board. There are other SRE teams that have responsibilities that work with it, but we are primarily responsible for maintaining our current HashiStack.
We also are responsible for updating our HashiStack. And, over the last two years since I've been on the team, we've spent a ton of time and energy doing so. We have conducted numerous upgrades, both for best practices and FedRAMP Authority To Operate, which requires us to be on well-supported versions of any software we're using. This has included three updates to Nomad and required three updates to Consul, completed in six weeks, as well as a complete architecture, migration, ownership transfer, and backend overhaul of Vault.
In the pursuit of Authority To Operate in the federal marketplace, we have instilled significantly more discipline in how we do software updates. This was on top of our numerous OS, Chef, and other system upgrades and major products that we needed to do both for FedRAMP and for best practice.
Let's talk about how we do updates. We think about updates in three phases: Planning and Prepping. Additionally, we have laying the groundwork, taking the work from phase one and making that actionable. Then, finally, step three, Execution, where you do the thing.
It's important to say upfront that upgrades used to be something you would do — or I would do — as an individual contributor. But with how complicated systems are, they need to be something your team does together. While you may be responsible for leading an upgrade — the primary engineer doing multiple phases of an upgrade — a successful and boring upgrade involves your whole team.
In step one, we write things down. We gather notes and information. We want to focus on gathering the shape, scope, and intent of an upgrade, as well as identifying the impact and how people may be impacted by that work.
When we start, there are two places we look: previous artifacts and vendor documentation. Previous artifacts are things in our shared knowledge space — Wiki, Slack, people's brains, that kind of thing. The second is vendor documentation, release notes, product updates, that kind of thing. They often have a bunch of stuff, but if you've ever been in the Hashi documentation, you know it can be buried, and you know it can be a little inconsistent.
This often prompts an email to our account team saying can we clarify? It says here two versions, but you say four versions. Which one is it? That often helps clarify and direct us in the right way to upgrade things.
We gather knowledge around an upgrade, and we seek to answer known knowns and known unknowns. For example, if a feature is being deprecated, I'm not going to know that, but the vendor release notes will. On the other hand, the vendor release notes aren't going to know that if I don't do Foo steps, I'm going to break database bar. In our case, it's Consul template and HAproxy, and our load balancers.
So, this is our chance to step through these things to minimize issues to teams — your account team or, your favorite community, or both, Reddit, Discord, who knows — can help you answer unknown knowns. Or at least give you a good panic attack about bugs, things that don't work, weird or novel interactions, things working wrong, or my favorite — things breaking after they work. However, we can't control those. Those go squarely in the unknown unknowns. We want to work to mitigate them. More about this later.
A second way of thinking about this is taking things out of the things you don't control and moving them to the things you do control. You likely know how your Docker containers are deployed, how their networking works, or other features, but you may be less clear on edge cases. You're likely unaware of issues with something like HCL potential bugs.
I think that in this chart, it's important to recognize people. While we personally haven't perfected mind control, we can control and set expectations and keep people well-informed — making sure that they stay well-informed during an upgrade.
We use what we call a one-pager. The one-pager allows us to communicate intent, responsibility, and known knowns and known unknowns — the things we talked about earlier. Equally important, it lets us define what success looks like, provides visibility outside and clarity inside our team. This also communicates likely timelines, as well as the expectations of the work needed. That is to say, things like epics or phases.
It's not important to be a hundred percent correct here. We aren't robots. Yet. It’s important to give an educated guess without spinning your wheels for too long. Another key difference is we think about upgrades as a project or a major milestone in our year, not as a task that somebody does. There may be 10, 20, 30 tickets per upgrade to make sure all the work gets done in the way that we think it needs to get done. This can turn into a vicious cycle.
It's important that you're not writing your whole plan down here. You're gathering information to eventually write a plan. But gathering this data can turn into a vicious cycle of getting up into your own head and overthinking things. Believe me, I know. The overall goal here is to spend enough time to be diligent in your discovery but move on to the actual laying a successful groundwork as quickly as you can. If you get further along and find you missed something, you got the tools to fix it.
We've got a bunch of unsorted Legos. We have an understanding of the project. We have notes, we have knowledge, we have intent, we have what success looks like. So that's a good start, but we also don't have them necessarily grouped. We want to move on to the groundwork phase to get them sorted and prepared.
When we lay the groundwork, the point is to understand the individual steps we need to take and sequence them together. Often, it works best when you put them in groups, prep tasks, execution validation, meta concerns, rollback, restore procedures.
You're basically connecting different-shaped upgrade Legos into a sequence that lets you be successful. After a few go-arounds, the groundwork phase is likely going to turn into the shortest phase of any upgrade project because software updates generally don't change that much. There may be a new step jumping between versions, some deprecation, but honestly, upgrading Consul doesn't change that much. Upgrading Nomad doesn't change that much.
The flip side is it's tempting once you've done it a couple of times to say we have a plan. Let's go do it. You're taking this time to actually test your plan and make sure it works. All upgrades take time and energy. Your HashiStack is likely a tier zero or tier one service in your company. You want the upgrade to be straightforward and boring so that if things go wrong, you can jump in and bring your big engineering brain to solve the problem.
Everything else in an upgrade should be on autopilot. A well-prepared upgrade means you've done all the hard work ahead of time, and it should feel like a non-event — a let-down — after all the work you've put into it. While you can't account for everything, you can account for the most common things.
And this is a way of saying you want brain space — I didn't expect anything there — during an upgrade. You write all this down so that if something goes wrong, you aren't looking for KB articles, commands validation steps, etc. Or if you are, you're looking for one rather than ten things. And trying to figure out how to sequence that one thing into your existing steps rather than trying to come up with a whole new procedure on the fly. The question is, how do you do all of this?
This overall step is your chance to validate, verify, and check you can roll things back and write documentation. For example, this is your chance to check that all of your commands play nicely together or things that don't work cleanly can meet up in common ways.
For example, if you take an EC2 list of Nomad names and try to make them match the internal Nomad IDs, this is your chance to make sure your commands make them sync up so you can run fleet-wide drain commands. This is also your chance to understand where and when you can validate and verify your system upgrades and talk about where and when you can cleanly roll back an upgrade.
Finally, by doing this, you get free documentation. Everybody manages ] after hearing individual contributors like me to write more documentation. You get this for free when you're doing it, and you get great runbooks.
Take it from me. Make sure that your backup and logging commands work. In my personal case, I was using a clever JQ command to change the logging level in Consul during an upgrade, but the command didn't save the Consul.HCL file. Instead, it just outputted what the Consul config file would've been. In my case, the upgrade went over fine, but after the fact, another engineer — thanks, Dmitri — said I don't think that works the way you think it works. He was very right, and I was very lucky.
Things that are going to ruin your day: Pagey is going to ruin your day. Bosses are going to ruin your day. Pagey is going to ruin your day with an errant page in a public space or people bugging you about an upgrade in a public space. Remember to include in your plan how to mute potentially affected monitoring while still having a place to check that your systems work. You don't want or need Pagey blowing up everything. But you do need to make sure that you haven't actually blown up everything.
Additionally, if you use Slack or Teams, tell people where they can follow along with an upgrade so you don't get pinged by ten people asking if a weird blip was your fault. Or even worse, thinking that your upgrade is causing issues and stopping to investigate nothing. Or your boss pinging you for status updates every 15 minutes. If you set expectations right, he knows, along with the rest of the company, where to go to look for statuses.
We brought this slide back. In terms of what we're doing, we're moving things from the right to the left. We're trying to get as much control over things as possible. While we can't fully control users, bugs, or edge cases, we can at least move them and make them more under your control because you have everything else completely under control.
You're going to have some sweet upgrade Legos. You're going to have some green ones, some yellow ones, some blue ones. It's going to be great. You're going to have a skeleton of assembled Lego blocks. This plan should have all the component steps, prep, communication, validation, etc. But it's not going to have things like the final server list. It's not going to have any PRs that you need to prep. It's going to have spaces where you're going to put those in But those get filled out in the execution step.
I worked with an Englishman who says it's what it says on the tin. The execution step is where you do the execution. First things first. You take your plan. You add the specific details we just mentioned — hostnames, prep PRs, blah, blah, blah. You communicate the data, the change.
If you work in an ITIL shop, you've already done all the work. You've written a plan, written rollback, written communications plans. But regardless of whatever kind of control system or lack of control system you're doing, you're answering who, what, where, when, why, how, and so what? First-grade stuff all over again. And, if you've been following the process, 99% of this is already done.
At PagerDuty, we don't use ITIL, and we expect teams to upgrade software in a transparent fashion. A long time ago, we used Failure Fridays to add strange and new failures to our systems and test how they would respond.
We've written and talked about this idea before, and I'm not going to go over it in depth. But as we've gone to microservices, we've moved out of a Failure Friday model to a Failure Any Day model. A failure, Any Day, happens any day.
These plans are used as a shorthand to communicate when a team is doing work, and they want that work to be visible across the organization — or they need cross-team support for something. For example, my team used the Failure Any Day model when we were upgrading to DNSSEC a couple of months ago.
We brought along all of the teams that had inbound services at PagerDuty and said you need to have dashboards ready for this so that we know if we broke DNS, how it impacts you. We also try to be cognizant of time zones when we're doing this to ensure there is adequate support in the event of any large-scale disruptions that inadvertently do happen.
To run a Failure Any Day, we need three roles: An Incident Commander, a Scribe, and Subject Matter Expert. An Incident Commander is in charge of running the Failure Any Day. They don't need to be technical. But they are responsible for following your plan, checking that things happen, and gathering consensus on if steps are completed, or validation steps look good. This can easily be your manager, a senior engineer, or someone from outside your team completely. The goal is to have your plan in a spot where anyone can execute it, given the right set of access and SMEs.
A Scribe's role in a Failure Any Day is responsible for communicating that work outside of your team and intercepting incoming communications so that doesn't land on individual SMEs who are doing work. They're responsible for informing people about the start and end of a Failure Any Day, and hitting the record button in Zoom. And oftentimes, they live-scribe the event so that if something does go wrong, we know when things happen for our postmortem process.
Finally, we have Subject Matter Experts. In this case, that's probably you. But if your plan's sufficiently robust, any member of your team should be able to execute on it. There might even be multiple SMEs that you want to rope in.
In Cloud Infra's case, the Incident Commander is often the person who wrote the plan the first time. A senior engineer acts as the Subject Matter Expert and junior engineers the Scribe, but that's not a hard rule. Sometimes, you need to be all three — or the Incident Commander and Scribe — based on how resources work or how other priorities land in your company.
Then, you go and do the actual upgrade. Don't F it up. It's super easy, barely an inconvenience. Then you do it again. An upgrade is likely in a region, and you didn't spend all this time and energy just to do it once. You'll likely have multiple environments, regions, or more that you need to do it in. So, when you're done, think about what went well, think about what didn't go well, reflect, and improve so that when you can do it again, you can do it better.
You'll notice the Legos have changed. In this case, they've rearranged to remove some steps and make things even better. Then you got to speed up and go faster. There are variable definitions of what gets you here. Automation, Infrastructure as Code, etc., can help you with upgrades. They certainly have helped my team.
But I think more important is knowledge sharing and familiarity with systems, because upgrading Nomad is scary for new engineers. Bring your junior engineers along in your upgrades. Last year, we had a co-op who was doing production Nomad upgrades at PagerDuty because we wanted him to get that experience. He's now joined our team as a full-time engineer. Hi Anthony.
Cut down and bring your unnecessary steps under control. For example, during our first round of Nomad migrations, we had a comprehensive checklist that took at least 45 minutes to check in each region to make sure that we were good to go on to cutting client jobs over to new nodes. And after a couple of migrations, it was down to 10 minutes, because we found a bunch of things didn't actually give us any more of an indication of success than other things.
We artifact opinionated configuration stances in Chef. This isn't rocket surgery, but we have the entire way we configure our stack in Chef. We pin default versions per environment, and we allow overriding them through the Chef inheritance hierarchy so we are able to change software versions.
We bake our upgrades over long periods whenever possible. Oftentimes, when we do a no matter Consul upgrade, we'll leave it running in a lower environment for a week to shake out any issues before we upgrade the next higher environment. But I know that that's not always reasonable. We also have some other tooling that helps. We have cluster Sentinel. It runs on all Nomad nodes and tests connectivity across our HashiStack. It allows us to be very confident that Nomad is working correctly and it's able to schedule jobs, talk to Vault, do Consul things.
After an incident in December of '21, we also built a similar job called Nomad DNS checker that checks DNS connectivity across our Nomad fleet, both internally and externally. So we know if anything has happened with our Route 53 or internal resolver — whatever it is — on a bunch of resolver deconfigs. So that we, again, are aware of any disruptions almost as soon as they happen.
We use backstage in a golden path. And outside of a few legacy services, our microservice environment is built using this golden path.t That allows us to build generic services that get deployed the same way. If there's a big change, we're able to update the skeleton and ask service owners to reapply the golden path to get the new things. We also have a lot of CI/CD inconsistency across most of our services, allowing us controls there.
For example, we check the HCL syntax of every job on every run in every deploy of a service. This also allows us to inject things in, like the deprecated Nomad port map. We were able to go and break people's builds and say you need to update this port map stanza in your HCL job. Here's how to go and do it.
We also have a good idea of our outlier services, i.e., the big boulders that can crop up during a migration, and we know to pay extra attention to and double-check that things are working correctly. An example here is our legacy monolith. You guys, I have a good Raft joke. I don't know any good Raft jokes, but if we all do, that works.
Every piece of software is different, but we're at HashiConf. Let’s Talk about some Hashi things. Their products run Raft. Raft has a leader. Don't upgrade their leader first. Don't cause unnecessary leader elections. Test what happens when you upgrade a single node.
How long does it take? Does it cause issues? Does it cause elections? Again, none of this stuff is rocket surgery, but it is all things you don't think about until you start causing them. In particular, build your upgrade plans to explicitly check and make sure you haven't triggered leader elections — and that if you do — you aren't about to do it again. Finally, make sure you have good ways to check this information. If it isn't in your documentation, it should be, or it will be when you're done using your artifacts to start your whole process again.
We learned with Nomad not to commingle our migrations if we can. In our case, we swatched VPCs, OS, and Nomad versions and reapplied about 125 PRs because of the VPC change. That's a lot of risk in a single upgrade. Also, we moved the 126, so it changed Raft for HCL versions. Don't do that!
As I mentioned before, we also use the CI/CD checking for HCL violations, and we use techniques like unlimited time drains on our job. So,we can say — 20 nodes, go do your thing, and we'll come back to you later and shut you down, murder you in your sleep. Then, finally, when we do migrations, we have distinctive names that make targeting resources on one set easy and make sure we don't accidentally co-mingle it with the jobs we're shutting down.
Consul can do a lot of things. You can put a lot of things in Consul. Know how you're using it and get tests for it. The biggest lesson we learned was shortening how long we did upgrades and reducing risk by going faster so that we didn't have disparate Consul versions in our staging and production environments. We also used a lot of Ansible in our Consul upgrades to make our upgrades faster. Then we had one playbook to upgrade servers that would upgrade all the clients first, stop the leader, then upgrade the leader.
Then, we had two other sets of playbooks. One updated about 95% of our fleet, and the remaining playbook updated the remaining 5% that needed to be done in a much slower fashion. So we were able to complete our Consul upgrades in about five hours an environment.
Vault's complicated. In our case, we weren't so much upgrading Vault as we were taking full ownership of it after it had been upgraded. Most of our lessons were around how to get it off of a Zookeeper backend and figuring out our final topologies. Our most important lesson here was de-risking our upgrades while delivering value. And we took multiple smaller downtimes for each step rather than one big downtime that may prevent teams from deploying.
Remember three things: Planning, groundwork, and execution. Define what success looks like for your team early. Build your plans with Legos because Legos are super fun. Upgrades aren't something you do alone. Learn and adapt so you can go faster.
Finally, thank you so much, and enjoy the rest of the event.