Sometimes derided by critics as being overly-complicated, systemd is actually really good if you focus on using "the good parts," says James Nugent.
Even though there is some dissatisfaction with systemd in the sysadmin community, James Nugent, an engineer at Samsung Research, shows why it largely fulfills its primary purpose.
In this talk he takes a tour of the great, and sometimes unappreciated or lesser-known features of systemd. He also explains how it can be paired with Packer and Terraform as a component of self-healing systems, and how it can also be used in a production-grade Consul cluster.
Hi everyone. My name's James Nugent. I work for Samsung Research and unfortunately my GIF game is not quite as good as Abby's. So, I didn't have time to go and like put new animated animals and stuff into my slides. So, sorry about that.
This morning we're going to be talking about systemd, which is an init system for Linux. When I made the submission, I expected that this was going to be a two track conference, and I'd be in the small room with like 15 people that cared about this. So, after Abby's rousing start, I'm going to put people back to sleep by talking about obscure bits of the operating system.
Just so I have an idea of the audience here, can I get a show of hands for people that run Linux systems on a day to day basis? Wow, that's—so this is probably easier to do it this way round. Does anybody use a Linux which does not have systemd? Cool, okay. Is it all Red Hat? Right.
I think it's fair to say that systemd is one of the most controversial software projects around. This is reflected in the coverage throughout the tech year. This came from The Register, the UK IT news site, the other day. This is one of the more restrained op-eds.
They range from sanguine like this, to more hysterical, like "Is systemd the end of Linux?" And clearly, from the show of hands, the answer is no. Some of the coverage is pretty reasoned in evaluating systemd against its own stated design goals, such as this post from the EWONTFIX blog. And some people have just table flipped on systemd and forked all the distributions, and taken it back out again. So it's also been a driver for people moving to operating systems which are not Linux. So you get a lot of people switching over to BSDs and that kind of thing, just to avoid this thing.
I'm not interested in relitigating the arguments for and against systemd, at this point, it's far too early to be drinking. So instead, what I'd like to do in this talk is focus on, assuming you're committed to using a Linux distribution, which uses systemd as its init system, what can it do for you, and how can you find more information about how to use it correctly.
So the chances are you're using a distribution, clearly from the show of hands, that has systemd support. But the project started in 2010 by a couple of guys from Red Hat, and it's spread throughout basically every mainstream Linux distribution. So we're now at the point where every long term support major distribution has systemd in it, so that's Red Hat, Ubuntu, and Debian based, and SUSE. That's another one I have no idea how to say. Maybe some Germans can tell me how you say SUSE later.
Each of these has the same init system now. That's uncommon. There were major differences between various distributions before that, and it was kind of impossible to write something that would work reliably and package reliably across lots of different distributions. So the genesis of this talk is the questions and comments I got after my talk at last year's HashiDays, which made me realize that a lot of people were interested in using systemd effectively.
In that talk, I went through bootstrapping the HashiCorp stacks of Consul, Nomad, Vault, on AWS, using some of the patterns that we'd developed for bootstrapping HashiCorp on SaaS. And all of the questions that I got were, "Huh, I didn't realize," they weren't really questions but it was, "Huh, I didn't realize that somebody could do that."
The code for that talk's still available, I'll put a link up to it afterwards, and fortunately other people have been maintaining it for me, so periodically I get a pull request, and it's like, "Hey, this is broken. You should fix it." That's where we're coming from, so let's go and talk about different bits of systemd.
One of the most cited complaints of systemd is that it doesn't obey the Linux philosophy, the principles put forth by the lesser known Unix inventor, Doug McIlroy. The program should do one thing, do it well, be composable via their text streams, all that kind of stuff, and that criticism's kind of fair. Systemd encompasses a whole suite of software and has a lot of moving parts. It's more than just a single process in the init system, and there are three primary aspects to it.
The first is the system and service manager, and that's what I'm going to focus on. That's the thing that runs as PID1, it replaces Run Levels, it controls the machine boot, and it replaces all of the shell scripts that traditionally form that part of the system boot.
The second thing it does is act as a platform for other people to build on top of.
And the third thing it does is provide this thing called dbus, which is like glue between the kernel and user space, that can be used for all kinds of black magic.
The most important thing to understand is that systemd manages units—or it manages resources represented in the system as units—and it manages them in a dependency graph. So units can declare that they require or have an optional dependence on some other units in the system, and the startup sequence will take care of ensuring that if I have a dependency on this thing, it will make sure that thing I have a dependency on has started before I try to start.
We could declare, for example, that the Consul agent requires the network to be up, and there's no point in trying to start the Consul agent until the network's up, because it's not going to do anything. So, what you can do is optimize boot into this directed graph, effectively. Now, in and of itself, that's not an innovation. So both launchd on macOS, from the mid 2000s, and SMF on Solaris and Illumos, also from the mid 2000s, implement the same model. And closer to home, Canonical's upstart init system does the same thing on Linux. That was used for a long time at Ubuntu.
But one of the major differentiators is the units under management in systemd don't just have to be services, and not everything needs to be started at boot.
The way these things get configured is you write a text file that describes the unit that you want to create, and you stick it in a different place in the file system, depending on who you are. So if you're building packages, or you're an OS maintainer, you stick it in /lib/systemd/system. You can override these by putting unit files in /etc/systemd/system, or for nonpersistent modifications for a single thing, you can put it in the tempfs file system in run.
It's important to understand the context of these things, so most files in /etc are marked as comp files in the package managers for the OSes. So if you upgrade a package, things in etc won't generally get changed, and things in lib will get replaced. So if you make local modifications in the wrong place, they're liable to disappear.
You get files named with this kind of pattern, where they have a name and then a type. So, we might have, for example, a unit called consul.service to run the Consul agent. We might have another one called ssh.socket, which is a socket unit for accepting connections for an SSH server. We'll talk more about that in a bit. The most common thing to find is service units, so let's look at the contents of one of the unit files that would run, say, a Consul agent. (Oh wow, that's kind of big, huh?)
The files are this format that evokes the Windows 3.1 ini-style system, or toml, although as far as I'm aware it's actually neither of those formats, and something completely custom, because... why wouldn't it be? And they're just key-value pairs, which configure the various options for a particular service.
So there are three important sections, two which are in every type of unit, and one which is specific to services. Looking at the two that are coming across everything, there's a unit section, which defines things like: what's the thing called, and what are its dependencies? So Consul isn't much use for that network, so we say that we're going to require the network to be online before we're going to start, and we want to insert ourselves in the dependency tree so that after the network's started, that's when we'll try and start Consul.
The install section, which normally appears at the end of these files, is the thing that actually places it into the dependency tree, and the way that's done is by hooking into another unit which is part of the boot process already. Traditionally the way that was done was through run levels. Systemd replaces those, so you can hook into multiuser.target, and that's basically the equivalent of saying, "This thing runs at run level 3," for people used to older init systems.
In the middle of the file there's this section which is specific to the service, and we specify things like which binary are we going to run, so /usr/local/bin/consul with the parameters, and who are we going to run it as. If we try and reload the config, say we change our TLS certificates or something, then what command are we going to send the running binary? In this case, SIGHUP, and how do we stop it?
Then finally there's a restart policy that says, "Let's assume this thing exits and it's not supposed to, what are we going to do?" So in this case, we're going to just restart it, and that's the universal answer to these, I'm surprised it isn't the default actually.
There are a few different types of these services that you can run, they all behave in slightly different ways. The simple service type is the default, so you don't have to specify that, but that's for normal binaries that behave themselves properly, and they just run and they stay attached to the terminal that runs them. They just keep running until they stop.
The second type, which is for programs, which is the use the oldest style of daemonizing themselves. We get into more ctls and adms in a second. I don't know the pronunciation of those ones, either. Forking things say this executable is going to die as soon as it's started. Because it's going to double fork trying to attach itself to PID1. So system B can keep track of those, and keep them attached to the correct services, and handle them correctly.
There's another type of service which doesn't keep running, but is useful to be in the dependency tree. So, these are things where you just want to run one command, and then once it's run successfully, everything else can carry on. But the thing doesn't actually stay resident and keep running. So they're called one shot services. Very useful for bootstrapping scripts and that kind of thing.
Finally, there is this type of service called a notify service. Quite often, especially if you're in a distributed system or something, it's not enough to just be running the binary to have this thing ready. In the case of Consul, the Consul binary can be running, but unless it's a member of a cluster, it's not that much use. And there's no point in trying to start anything else that's downstream of it.
So there's a pattern that we can look at later, which Consul supports natively, but you can also support in anything else. Where a system can tell the unit system when it's ready for actual work as opposed to just when it's running it's binary.
This is where we get into the bad pronunciation. These system things are run using the command, I'm going to call systemctl, I've heard people use system control, or system-CTL or whatever, I'm going to use systemctl. So, of these commands, these are the commonly used ones, there are a bunch of others as well. So units are installed, but not enabled, which means they won't start at boot. And for service that's not that useful, but for the desktop it's very useful. You don't want all this stuff starting every time you start your laptop. You reserve your panic for when you plug in a mouse, or not, but you don't want it to happen at boot.
So, enable will mark that it should be started at boot. And then start, stop, and restart, are fairly self explanatory. They send the commands or run the commands that are specified in the unit file. And status, you'd think would be obvious, but it's actually really useful. If we get a look at the output of status for a flat Consul module, we've got like a ton of information here.
Some of this stuff we've got is, the unit name and the description, as it came out of the unit file. And we've got the location that the unit file was loaded from. Which is incredibly useful when it has one name, but it's actually something completely different. The other thing we have is the enabled state. So in this case it's disabled, because I never ran systemctl enable. So that means it won't run at boot. If we rebooted the box, we wouldn't have Consul running. We have the uptimer thing so clearly this screenshot was for this demo because it's been running for seven seconds, and actual production service.
The next thing we have is kind of huge for a Linux init system, and it's traditionally not been very possible. We have all of the processes that this service is responsible for. So the way it does that, is systemd launches all of the processes related to a particular service, into a cgroup that only has that service's processes in it. So, in this case, in the system slice which is a unit of accounting, it's created a cgroup called consul.service. And in this case, we only have one process, because Consul behaves properly, but if we had something that did forking, or if we were running like Apache, where it has a main worker and then a bunch of additional workers, and then a bunch of CGI scripts that are being exact, they'll all run in the context of the same cgroup.
So it's finally possible to actually stop a service. Previously, you could stop the main binary if you knew where it was, or you go chasing through some PID file, but it was never actually possible to terminate everything associated with the service in one easy command. And it now is.
Finally, we got like the last 10 log lines. These are just stdout by default. You can kind of configure what gets logged. One of the more controversial parts of systemd, is that it has a binary logging system called the journal. It kind of works, I don't really get why people are that bothered about it, but it works.
So we've looked at how units get configured by lonely package maintainers, but quite often what you want to do is just override one or two settings. You don't want to go and rewrite the whole unit file, because you want to run a bootstrapping script, or because you want to change an option.
So, there is a system we call drop-in configuration. What that does is allow us to override the default supplied in the system by putting it in a new unit file. They live in this part of the tree here [15:08], at etc or lib, depending on where you're putting it, or run; depending on the semantics you want. Then as a per-service configuration directory where you can put as many files as you like. It's commonplace to name them according to numbers. So you put ten config, or whatever according to the priority, and they get loaded alphabetically. That is alphabetically, so they're electrically sorted, so ten and a hundred file names, anyway.
There's a problem whenever you end up with a system where you can put overrides everywhere. Which is what actually the effective config running. So if you load a service that has a drop-in config the output of status changes and you now see, not only the unit, which caused the thing to exist, but you also see all of the drop-in units. So you can look at any particular running service and know where did this thing get its config from, and in what order. Which is kind of useful.
That works for the individual service, and if you're interested in doing that for a whole box, there's a command that comes with systemd called systemd-delta. And that will tell you the effective running config of every service on the machine and every config file which has been loaded. Kind of useful.
There's a common bootstrapping pattern using drop-in units. So quite often you'll find some software that's either you've packaged yourself, or is included in a base system repository. These things have to work in a lot of different places. So people tend to make a very generic configuration. For example, that Consul package, we just get the binary and maybe some default configuration that turns on a couple of common options.
But for environment-specific things, we want to be overridden by administrators. So what we can do is build an environment-specific package it just has a drop in unit in it, and then install that, reload the systemd configuration, or reboot the box, and the environment-specific config will override the generic config that comes from the base package.
This can be used for separating out concerns when you're doing things like bootstrapping across different clouds. Bootstrapping scripts are often heavily dependent on a particular cloud's API, so you just build multiple packages. It's kind of important to install these things via packages and not by user data, sorry. Because you can cryptographically verify that the conflict hasn't changed. At least apt can do this, I assume rpm can do this, but actually not. With apt, you can verify cryptographically that the contents of the machine are as they were installed. And that's kind of useful, and actually required in some highly regulated environments. This is a working example of this pattern for Consul, in that repository that I was talking about, I'll stick the link up at the end.
So we talked about service units, but there are lots of different types of units. I'm going to cover a couple more of them. The first one is target units. Targets are largely a replacement for run levels in older internet systems. So traditionally, you'd boot through a number of different phases in a machine, it would run in single user mode or rescue mode, and then it would boot to multi-user, and then if you're on a desktop, it might boot to a graphical run level. I think these were like one, three, and five, or something like that, I don't actually know the numbers.
Systemd replaces these with named targets and then it symlinks the old ones. So there's a run level three dot target, which is a symlink to multi-user dot target. But an important difference is that we can define our own run levels. This can be really useful for orchestrating system start up.
Let's imagine we have this script. This is not the correct way to do this anymore, but this used to be required. Imagine we have this bash script [18:54], because all problems are solved with bash, that pings the local Consul agent to see if it's part of a cluster or not. And if it is, the script will just exit. So we can use a one shot service for this, and it can start after Consul. So after this thing is started, we're going to start pinging it until it's joined the cluster. And then, what we want, is for all of the downstream services, which are dependent on Consul, to not only wait for it to start, but also for it to be a functioning member of a cluster before we start downstream services.
The correct way to do this now, is different, but this was the old way. So we can put this service, this one shot service into its own target by writing a new target file called something like Consul online. And then all of the downstream services can use the requires phase, for this service requires Consul to not only be started, but also for it to be online. Targets get reached when all of the services within them are started successfully. So that's how multi-user works, and you can define your own things in the same manner.
The correct way to do this now is to use this pull request that got merged into Consul. It was literally like a week after I did the original talk as well. This has been in production for a while now. So what Consul will do now is notify systemd when it's a member of the cluster, according to the docs. I haven't actually tried it. I believe it works.
The way this actually works in practice is when the service gets started systemd sets an environment variable on it which tells you where a domain socket is. And when your service is finished bootstrapping itself and is in a state ready to do actual work, you can write the string READY=1. And it's obviously case-sensitive, because everything in systemd is case-sensitive, because they hate everybody. I actually spent ages looking for this problem, where it just had the wrong case on the string.
Consul will do this now, and as soon as it's a member of a cluster, it will notify the init system. So you can get rid of that whole target, and just make everything depend on Consul. That's now the correct way to do it. But it's still useful for things that don't support this yet. And there's actually a library that's part of Consul if you're writing Go services, that you can ... it's like three or four lines, so you don't really need to put in a library, but if you like libraries, then there's one that's part of Consul that will do this for you.
One of the use cases for this is services which need to prime caches and things like that. It's just not enough to know that the binary's running. You have to know it's ready for work. And you can use health checks for that. But there's no point in pinging something for a health check if it's not actually running. If it can't possibly work yet, then why bother checking it? Just let it tell you when it's done.
Another useful thing is socket activation. This is especially useful. Does anybody here write services in Go? Cool. Who likes trying to drop privileges in Go? Cool. You can't really, because of the threading model. This is useful if you're after the principle of least privilege.
What socket units allow you to do is have systemd listen on a network interface on your behalf. And when there's traffic, it will start your service and pass a copy of the file descriptor that represents that client down to the service. Lennart Poettering the creator of systemd, whose name I just butchered, posted about this on his blog. And there are some important bits in this pretty long quote [22:35].
The first thing is, if a service dies, its listening socket doesn't die with it. So you don't lose client connections. This was a big deal for HAProxy for a long time. They finally fixed it via some ridiculous amount of code. But that's because they have to be cross-platform. But on Linux, you can just do this socket unit and then you don't have to care about it.
The other useful thing is, if a service is being upgraded, you don't have to disconnect all the clients, because systemd is holding their connection. You just go and recover the connections afterwards. So if you want to upgrade a long-running service, you can do that.
This is really useful if you want to, say, bind to a privileged port. You don't really want to run your services root. But if you want to bind to a privileged port and you don't want to screw around with setting capabilities and things, then your only real option used to be to run the thing as root and then drop privileges. But the Go threading model doesn't allow for that, or, at least it doesn't allow easily for that. So, this is a good alternative.
How do you actually do this? Here's a program [23:34]. I believe this is the production system used for the UK inland revenues customer service line. And it's an HTTP server that returns 404 for every single request. You can tell what I spent my day doing. We can modify this to use socket activation instead by replacing a couple of little things.
All we're doing here is starting a TCP socket on port 8081, and then serving HTTP over it. To modify this to use socket activation instead, we can pull in the CoreOS systemd library, and rather than creating our own listener, we can use this function that will go and get the listeners that systemd has started for you. If you don't have one, that's a problem. Then you can just call HTTP serve in the same way as normal. The difference here is we can now have multiple sockets because you can be listening in lots of different places for the same service.
If we went and built this new service, we'd go build 404.go. And we can go test it using this systemd-socket-activate, which is effectively what gets used underneath. You don't actually have to deal with this on a regular basis. But if you want to test something outside of the init system, then you can use this to simulate that. You can tell it to listen on port 8000, and then run your 404 binary when it gets traffic. You get the output saying it's listening and then, a separate terminal, we can call thing and it will return 404 every time.
But at this point, systemd is listening and all you're doing is accepting connections. The actual listener doesn't need to die and it doesn't need ... You could bind to port 80 here, run the socket as root, and then the service as something that isn't root.
To make that actually work outside of the command line, you need a service unit. This is the simplest possible thing for a Go binary that behaves itself. And you need a socket unit. What the socket unit does it just says, "Which port are you gonna listen on? Are you interested in IPv4 versus IPv6?" That's actually a default. You don't have to specify that. So it's not quite the simplest possible thing. But unfortunately, systemd hasn't quite succeeded in unifying all the pointless differences between Linuxes because people get the option of still setting these stupid defaults. So some have it turned on. Some have it turned off. So it's better to always be explicit about these things if you're interested in running cross-platform.
And then finally, we install it into the sockets target which is something that runs after the network is up and available. And you'll notice that we didn't link the socket and the service together in any meaningful way. So that doesn't have a reference to the socket, and that doesn't have a reference to the service. They find each other by a common prefix. If you happen to name them different things, it is possible to link them via a different directive in the file.
One thing that I didn't cover there is graceful shutdown. Right now if you kill the service to restart it or something you will kill any active connections. There's a great blog post. I'll put the link up at the end. That covers how you actually do graceful shutdown and zero downtime upgrades using this library and this pattern. Worth a read if you're doing Go services.
The last one that I want to cover—there are a bunch of different units I'm just not gonna talk about—but the last one that's interesting is a replacement for cron, which is called timer units. Timer units. Does anybody like the cron syntax? One person. Wow. Okay. Basically you don't have to do that anymore. You can write the service file at the top. This thing just dumps the date onto the terminal, I think. Is that what it does? Yeah. And there's a timer unit file underneath. Again, it's linked by the service name to the timer name. You can specify them if you need to call them different things, though. And this timer thing has lots of different options in the unit configuration.
One of them is on candor, and this says it's gonna run every 10 minutes. There are also things like, say, "run this two minutes after startup of the box," or something like that, or, "run this on every hour, provided day ends in ..." Well, they all end in Y. Nevermind. Bad analogy.
In general, the documentation for these is pretty good. Once you know the things exist, it's easy to find out what the various options are. But actually understanding the scope of the entire system is quite hard. This talk only really scratches the surface of the capabilities of the system as a whole.
We haven't talked about the logging subsystem, except to say we didn't care about it. We haven't talked about any of the desktop-oriented features. You have the ability to start services when particular devices are connected. That's how you get Linux to come up when you plug a projector in, in more efficient manners. We haven't talked about how to manage temporary files using temp files units. We haven't talked about how to do network awareness, "Run these services when you're connected to this type of network." Very useful for connection to public wifi, that kind of thing, if you're on a desktop. And we also haven't talked about logind, which is the user session manager.
The other things I haven't talked about are resource accounting. Because we're running everything in cgroups, you effectively get all of the resource controls available in Linux available to you via systemd. And there's a program called nspawn which will effectively allow you to launch binaries which are not packaged into any kind of container format that will run with all of the isolation of, say, a Docker container. Again, there's a bunch of libraries from, mostly CoreOS that allow you to natively integrate your own stuff with all of that, in particular if you're in Go. There's lots of other people's libraries for other platforms, too.
If you're interested in more information about that, then there's a link at the end to the documentation that will tell you about all those different types of units that we haven't had time to talk about. With that, I will not keep people from coffee any longer. Thanks for listening, and there's a list of all these references here.
HashiCorp Terraform & Vault Customer Roundtable: Schlumberger, Cimpress & Anthem
PMI's Journey With Terraform
Adopting Consul for Service Discovery at Mercedes-Benz
How Cisco Operationalizes Vault as a Multi-Platform Enterprise Offering