Presentation

Cloud-Init: The Good Parts

Cloud-Init is the de facto industry standard for early-stage initialization of virtual machines in the cloud, but few engineers are familiar with everything that it has to offer.

All Linux virtual machines have Cloud-Init in their boot phase, whether they're as small as a t3.nano instance in AWS, as large as a Standard_HB60rs in Azure, or and on-premise OpenStack instance. Originally designed for Ubuntu in EC2, Cloud-Init provides at-first-boot configuration support across most Linux distributions and all major clouds.

Many operators are familiar with supplying a shell script via user-data when provisioning their compute resources, but Cloud-Init has a massive amount of other functionality that is, more often than not, left untapped.

In this talk, Event Store co-founder James Nuget explores that untapped potential by looking through some of the features of Cloud-Init and how to take advantage of them to improve the operability and resilience of your cloud operations.

Speakers

Transcript

Hi, everyone. Sorry for anybody that came expecting to talk about Vault.

Just so I have an idea of who the audience is here, because I know this conference attracts a lot of different backgrounds, who here runs cloud software on a day-to-day basis? Cool. And who’s familiar with Cloud-Init? Who would call themselves an expert on Cloud-Init? OK, cool. That’s a great audience for this thing.

The genesis of this talk, and the reason that I changed it at the last minute, was that last year I did a talk at HashiConf called "systemd: The Good Parts". I got a bunch of good feedback about that, and I was thinking, "What other tools are there where everybody has passing familiarity, but no one really digs deep into them or necessarily takes the time to go and explore all of the options?"

So here we go. This is "CloudInit: The Good Parts," and it’s a surprisingly long talk given the title.

What is Cloud-Init?

It’s pretty much the de facto industry standard for early-stage initialization of virtual machines in the cloud. By early stage, I mean on first boot or on subsequent boots of a particular virtual machine. It’s used to specialize a generic operating system image that might be provided by one of the cloud vendors or one of the OS vendors, and it’s used to specialize at runtime to do whatever job the virtual machine is supposed to be doing in your infrastructure.

It was originally developed by Canonical when EC2 was first announced. Back then, Ubuntu was the only thing that ran on EC2, so that’s all they had to work on. But it’s now prevalent across every major cloud and every operating system pretty much, so you can find it pretty much everywhere. All of the Linux distribution images ship with it by default. FreeBSD, SmartOS. Everything ships with it by default these days. Furthermore, not only operating systems, but also across every cloud. Almost every cloud vendor has really good support for Cloud-Init built into the provisioning plane.

Why do we want to runtime specialized machine images?

If you’ve heard me talk at a previous HashiConf, I often talked about image-based workflows and package management as being the solution for this, but that can be costly in terms of time. If you’re trying to get software out quickly, then taking the time to boost an instance, build an image from it, can take too much time, anywhere from 5 minutes to an hour depending on the underlying platform.

If you’re doing rapidly evolving software, if you’re trying to release multiple times an hour or something like that, then that just might be too much of a cost to pay. It might not be, but there are tradeoffs to be made. But both approaches definitely have their place, and it’s worth knowing about both. A correctly constrained runtime configuration can provide many of the same benefits as an image-based workflow, just a bit quicker.

And we can still use Cloud-Init in the image-based build process, so we’ll take a look at that shortly.

Let’s look at this by example—and I’m going to go through a whole bunch of examples. One of them is just running scripts, the simplest thing that can possibly work. Then we’re going to look at changing default user configuration, installing packages, suppressing some of the default behavior (which might be undesirable), writing arbitrary files, configuring SSH keys, and then we can take a look at how you can find out more about what this thing can do, because it’s not discoverable.

Configuring Cloud-Init

To configure Cloud-Init, I just installed on all of those images that I listed earlier. In systemd-based Linux, it’s usually a service that runs on boot before most other things. When it starts with the init subcommand—there are some others as well—it runs a sequence of modules that specialize the machine in different ways.

And configuration comes from 2 places. The cloud’s provisioning plane supplies a bunch of metadata about the particular machine: its network configuration, its disks, all of that kind of stuff.

Then there’s user supply configuration, which is what we’re going to be talking about mostly today in our examples. Generally, whenever you provision a virtual machine—obviously everybody does that with Terraform and not using the portal—but if you are using the portal, there’s usually a thing called user_data, Azure calls it Cloud-Init directly, but every cloud has some different way of specifying a bunch of data that’s going to form the configuration for Cloud-Init, ultimately.

Example: A small script in user data

Let’s look at the first example. The simplest thing that can possibly work for doing anything with Cloud-Init other than what it does by default. We have a shell script, and all the shell script does is echo some text, including the date stamp, into a file when it’s run. If we run an instance in EC2, we can specify at the bottom this user_data, and then use a file reference to the file on disk.

The API expects that to be Base64-encoded data, and we’ll see why shortly. So not just plaintext. But the CLI takes care of that for you if you reference it in this way. That’s true of most cloud vendors’ CLIs.

Once we’ve run our instance, we SSH into it. We get this annoying message about the host keys not being trusted, but we accept that on blind faith and carry on, as everybody does. We get the message of the day out of it. Then if we cat the location that we expect, we can see that the scripts have run.

…We can see it ran as root, so that’s interesting. Our shell scripts run as root when we start up. If we reboot and wait for the machine to reboot itself, reconnect, then we can cat and file again. But what’s interesting is it hasn’t run again. This is a one-time thing that runs the first time you boot and not again. There are other modes that will allow scripts to run again, but for the default, if you just provide a shell script, it’s going to run once when you boot. So it’s great for things like installing packages.

You don’t need this in Amazon, but in other clouds you might need to configure machines to act as routers or specific network roles. These scripts are great for configuring that kind of thing. There’s no need to get any more complex than that.

Some reminders on Bash

Writing correct Bash is hard. Almost no one, myself included, write Bash which actually handle all of the error conditions properly. In lieu of doing that, put that at the start of every script, and that will catch most of the problems and prevent things from just blindly continuing in the face of failure.

shellcheck(1)—always use that. It’s a utility that will lint shell scripts and pick up almost every type of common misunderstanding in the syntax. There’s no reason not to use it on everything.

Some other reminders on Bash: Not every OS has bash, or the latest version of it. MacOS is a particular offender here. Ships Bash 3.x. If you write your scripts on a Mac, run it with a default Bash and then run them on Linux. You’re not running on even the same major version of Bash, and Apple are not going to upgrade it. They’re going to ship Z shell instead because of licensing concerns.

If you’re on FreeBSD, for example, the root shell is different from the user shell, and neither of them Bash, but Bash is available.

Doing this isn’t necessarily the most portable thing you could do, but, the reason that our script ran under Bash was because of the sha-bang at the top that told us to run under Bash. We can provide other things as well. We could provide a Python script in there.

Calling Python portable is a bit of a stretch, I guess, given that you have to decide whether you’re on Python 2.x or 3.x and then how it’s installed, on whatever operating system you’re on. But for Ubuntu, that works if you reference Python 3.x instead.

If we do that, and we can boot an instance in the same way and then SSH into it, and now we get the "Written with Python" message with that crazy date format. We had to specify explicitly Python 3.x because you have to specify the interpreter.

We’re still not doing anything that’s particularly portable.

Not every image is going to have Python installed anyway. Even if it does have Python installed, you have no real way of knowing whether Python 3.x is just going to be the Python binary or some other crazy combination, and you have no way of installing packages either. So this doesn’t really solve the problem of Bash being nonportable.

There’s a whole bunch of things that you commonly want to do, and Cloud-Init has these modules already built in and already configured.

Configuring with #cloud-config

What we can do is configure them using a format called #cloud-config, and then we don’t have to write brittle Bash scripts to do all this configuration. Hopefully, Cloud-Init does the correct thing and handles all the error cases. And in most cases, it appears to.

Here on screen is an example of #cloud-config. It has a sha-bang-like thing at the top that identifies it as being a #cloud-config file. I don’t think that’s strictly necessary, but it’s good practice.

In this particular example, what we’re going to do is override some of the default configuration. Cloud-Init does some work regardless of whether you configure it or supply any user data at all.

Whenever you start an EC2 instance—and this is also true of most other clouds—the image doesn’t have any users created other than root. You probably know that if you start a different image, you get a different username—so" EC2 user" for all the CentOS derivatives, "Ubuntu" for Ubuntu. I don’t know what Debian is offhand, but they’re all different. And Cloud-Init creates that user.

Then it pulls down the SSH keys that you’ve configured in the keypad configuration into the home directory for that user. The default configuration in the Ubuntu image for Ubuntu LTS is to call that user "Ubuntu."

And that’s fine, I guess, if you like that. Most people want to specialize that to something that’s either a bit more generic and cross-operating system, if they’re running more than one. Or it’s the company name, or they might want to create individual users.

One of the things we can do is override that default using this bit of config here, and because it’s YAML, you just let the IDE indent for you and hope that it gets it right, because it’s completely impossible to figure out what’s a list and what’s a map and what’s anything else. But I tested all these and this one does work.

What we’re going to do is, instead of creating a user called "Ubuntu," we’re going to create a user called "ops," and it’s going to be in a group called "ops." It’s going to use Bash as its shell. It’s not going to be password-accessible. It’s going to have passwordless sudo and GECOS.

I never knew what that stood for until I was writing this slide, and I went and looked it up. It stands for the General Electric Combined Operating System, and it covers things like what room the user sits in the General Electric building, so that’s obviously useful to have on every cloud instance.

Once we’ve written a YAML file, we can provide it in basically the same way. Now, instead of providing a shell script here, we’re providing the YAML file. Then once we run our instance, we can SSH in. We still get the annoying host key problem. Let’s try a root instead.

If we SSH in as root instead, we can see that appears good. We have an information leak because it’s told us what the default username is, which is probably not a great default, but we can see that the default user looks to have changed to "ops." If we SSH in as ops, then get the directory entry for it, we can see that the information that we set is all there. So that appears to have worked.

#cloud-config schema

It’s a YAML structure, so what’s valid is not even defined. It’s a combination of the version of Cloud-Init you’re on and the modules that have been installed, and you can install custom modules if you want. It’s not very discoverable. The documentation is somewhat hit or miss.

This is an example of English understatement. The documentation is not hit or miss. It’s bad. Most of the information is kind of in the docs, if you read it right and have interpreted it a bit funny. All of the information’s in the code, and, generally, if you’re trying to write this and get all of the functionality, the only way to discover it is to go read the Python code.

There are usually multiple different ways of achieving any particular task that you want, a feature we’ll see in a bit. The problem with this is that writing the config files is going to be an iterative process. It’s very rare to write one and have it work the first time. Ultimately, what people end up doing is having this little library of snippets that do particular things that they’re going to want on a regular basis and then cargo-cult that throughout all of their projects, at any company they work at, and that kind of thing.

Cloud-Init does have some schema validation, though. We mentioned the init subcommand earlier. There’s another command called schema. We can take a YAML file here and run cloud-init devel schema, with the config file. It tells us "we’ve got valid #cloud-config." If we have, say, an extra quote on line 3, it will tell us w"e don’t have valid #cloud-config."

Unfortunately, it doesn’t go beyond being a YAML validator. So when we have this clearly valid config, it tells us it’s valid despite being clearly garbage. It’s not that useful as a tool to validate what you’re doing. It’s just a YAML validator, and every editor has one of those anyway.

It took me a long time to get Cloud-Init installed on macOS to be able to run this, because you have to basically compile it from source, because no package manager has it. Because who’s doing Cloud-Init on Macs? It was totally not worthwhile, it turns out.

Installing packages at startup

Just look at one of the other things that you might want to do commonly on startup. You might want to install some packages. We can write some YAML for that, and we can tell it 3 things. We’re going to install Docker, and Docker isn’t the Ubuntu 18 package repos, at least not under a name that I can find. So we’re going to install it from the Docker registry, and have a Debian repository that ships the Docker Community Edition as a package that you can install. But we need to configure that.

What you do there is write out a listfile and a Bash script to /etc/apt/sources.list.d, or something like that, and then apt update, and then apt install the thing you want. It’s error-prone, especially if you want to verify the keys as well. APT is a bit better at this, but yum is terrible at it, and it will tell you it’s validated the keys, but actually it hasn’t. So unless you do it on the third Thursday of some year that has a full moon on it as well, then it will not validate your checksums, and it will tell you it has.

But we can configure the source using this apt/sources section of config. Then we can tell it on boot, "You’re going to install Docker CE." We’ll do that after it’s updated the package sources, so that you have access to everything you’ve configured. Then finally, we need to restart the machine before Docker will come up. I’m not sure if that’s even true anymore, but it was true at one point, so I’m going to do it anyway.

We can tell it: After it’s done everything else, before it runs the user scripts, it’s going to reboot the machine. Then, rather than running all of the user scripts, which can be supplied alongside this at first boot like we saw earlier, what we’re configuring it to do here is to run them on second boot instead, so that we have an opportunity to preconfigure some things.

At that point, if your script needs to use Docker or whatever other package you’ve installed, it’s going to be available and ready to go by then.

If we run an instance with that config, we SSH into it after it’s rebooted, so it takes a little while, because it’s going to reboot twice. Status on Docker: We can see it’s running, which is great because it’s not installed in the image normally. If we run the Docker CLI, we’ll see that’s there on path for users as well, which is also good.

You can do this for a bunch of things. One of the common use cases is if your CI pipeline emits Debian packages for your software or yum packages or whatever, you can install them as your machine comes up by adding an S3 bucket as a repository and then using the HTTPS plugin to make sure that you’re not man-in-the-middled on the way down.

Verifying communications

Let’s take a look at a slightly more complex example, and this is going to end up demonstrating the way that there are lots of ways of doing anything and no clear best way to do any of them. Let’s come back to this message: "The authenticity of host x can’t be established." When we SSH into the machine, we blindly accept it on faith that we’re talking to the machine that we asked to talk to, and we’re not talking to MI5 or the NSA or something like that in the middle.

We can verify in the future that we’re talking to the same machine we were talking to, but we can’t really verify we’re talking to the correct machine. If you’re doing something like spinning up a bastion host that’s going to access, say, your database, you probably want to know that you’re talking to the correct machine. Now, there are other ways to achieve this as well. But one thing we could do is set our own host keys.

If we say yes to this message, then we see this scary thing, "Permanently added this IP to the list of known hosts." That, in the cloud world, is useless because they spin up and down like crazy. I emptied mine the other day to test that this worked, and I had 140,000 entries in my known-host file over not that long, like 2 years maybe, something like that. So this can build up pretty heavily.

I also worked on a cloud provisioning claim for a long time. That’s why I have so many. It’s not that common. But 1,000 is not uncommon, and we’re effectively accepting these at face value, which is not ideal.

One thing we could do if we’re Amazon is go to the instance console, once we’ve booted, and print out the SSH host key fingerprints, so we could go and compare that to the hash that the thing gives us, and that would be good.

We can also get this via an API if we wanted. We could probably pause it and work out what the host keys are and make the entries automatically. That’s fine as an approach, but instead what we’re going to do is set our own host keys, because that would be too easy, and it doesn’t really involve Cloud-Init. Although Cloud-Init does print this.

Actually, if you look through the system log, you can see as Cloud-Init runs. This is really useful for debugging things that affect your access, because if you’re doing things with the SSH server, for example, you quite frequently lock yourself out of the machine.

The other downside of doing this is that, in Amazon, it takes a couple of minutes after the machine has booted for this log to appear. So you end up with pretty big cycle time on it.

Let’s look at setting our own SSH case. This is my built-in way of doing this. I found it after I did it the hard way. So we’ll do it the hard way, and then we’ll look at the easy way.

We need to build it ourselves. Breaking this down, we need to do a couple of different things. We need to generate some known host keys and get them to the virtual machine. We need to move the keys into /etc/SSH before the server starts. Otherwise, it will complain that it doesn’t have any, or worse, go and generate some of its own. To do that, we need to know where in the Cloud-Init process we can hook to get all this logic in place.

Cloud-Init runs in 3 phases as you boot a machine. There’s an init phase, which is before the SSH server comes up. There’s a config phase, which is, in theory, supposed to be stuff that doesn’t affect boot—but actually everything affects boot, so the config phase is useless, and actually there are only 2 phases. The final one is configuration that you want to run after everything else is run. This is user scripts that might want to use installed packages or installed configurations from the init and config stages.

The configuration as to what runs in each phase, at least on Ubuntu, lives in /etc/cloud, cloud.cfg, and it’s obviously YAML, because why wouldn’t it be?. One of the pieces of configuration in there is: what module runs in which phase. If we go to look at this, we can see the modules that are going to run in the init phase. One of them is this thing called write-files, which looks useful since …one of the tasks we identified was writing some files into /etc/SSH.

The other one, if we go look at the docs for that, writes arbitrary content to files, optionally specifying permissions, which is good because we need to do that too. This is the doc format. It’s all generated from the Python source code, so it’s all Sphinx with a slightly nicer stylesheet than Sphinx normally has.

Then finally there’s this SSH module down at the bottom, which configures the SSH daemon before it starts. If we write files, before we do that, we should be good with that desired order. We can verify that Cloud-Init runs before SSH. If you look at line 12, you can see it’s before the SSH statement. So we’re good to do that there.

About write-files

Some more about write-files. The file needs to be provided embedded in the YAML config, which is annoying. If we want to provide them from a remote source, we could verify the checksums, download them from S3 or something like that. For each file, we can set all of these different things. Here’s an example. You can see the top one has some Base64-encoded text, and it’s going to write to /etc/sysconfig/selinux with "Rewrite. Everybody else read." The others are all configured in various different ways.

Generating keys using Terraform

We need to generate the keys. One way we could do this is using the ssh-keygen program that’s built into most Unix-likes. What if you’re on Windows, though? Does anybody deploy from Windows? Wow. No one. OK, in that case, you can just do this.

But Windows generally does not have ssh-keygen. There are a bunch of tools; all of them are awful.

Instead, what we can do is use Terraform for this. I used to work on Terraform, and before I dig into this, I’d like to take a minute to congratulate the Terraform team on 0.12, which is a fantastic release and the most enormous lift I’ve ever seen in an open-source project while retaining compatibility. Great work for everybody involved.

This would not have been possible in 0.11 in such an easy way. This is all 0.12 config.

We can generate some SSH keys using the TLS Provider. Turns out private key’s a private key regardless of what format you put it in. We can generate one for RSA and one for ECDSA, which are the 2 commonly used key formats, using the private key resource. Then one of the options on that resource, one of the attributes, is the OpenSSH-formatted public key. This is a slightly more portable way of generating these keys.

If we apply that, we have the keys formatted as we expect at the bottom. The next thing we need to do is get it into the right files configuration. We have Terraform up here, and we have a YAML template that’s going to set the filename, the path, the permissions for the public and private key, for the host of each key type. We need to somehow get the private key bits into some structure that we can provide to this template that has the bits in the right place.

We can do that using this, which is pretty nice. For each private key, we can provide a keys object in the keys parameter and pull out the public key, the private key, and the algorithm name, and that will give us everything that we need to do this. If we apply that, then we get a write-files thing in the correct format, which is nice. We can use that directly as the argument to use the data when we can provision a machine.

We need some additional configuration, though. By default, …you write the keys, and then Cloud-Init deletes them again—because why wouldn’t it do that? Then, once it’s deleted them all, it recreates new ones. It just looks like it’s not working at all. If you set ssh delete keys to false, then it won’t do that and furthermore won’t try and delete them in the first place.

But now we’ve got 3 different things that we want to get up there. We want to get the user config. We want to get the additional config for Cloud-Init and for the SSH initialization, and we want to get our write-files thing up there. We need some way of providing more than one file as user data. The way Cloud-Init deals with this is somewhat baroque but works. It’s multi-part-encoded, effectively MIME-encoded, email-encoded files.

I wrote a post about this the last time I went looking for this information, and if you look in the address bar, that was 2015. It basically hasn’t changed since then, and this is still the only documentation about how it works. I found myself reading it the other day thinking, "Who wrote this?" And then realized that it was me.

Terraform can generate this stuff. It can generate the multi-part bits, because you don’t want to do that yourself. There’s a data source called "Cloud-Init config," and you can give it each part of the config you want and give it the content types. If it’s #cloud-config, you give it that. If it’s a shell script, you call it a text shell script or something equivalent. We can render our write-files config directly into the template, and that has a property on it called "rendered" when the data source runs.

We can provide that as the user data to an AWS instance, and then we can do 2 more things. We can provide the public IP of the machine, and we can format the known host entries that we need to put into our file, so that we can connect into the instance.

Let’s run this. If we apply, these are our outputs, so we have the public IP of the machine, and then we have 2 entries that we could go and put into our known host file and truncate it slightly to make them fit.

Then, we can try and SSH into the public IP, and we can say, "We don’t trust this fingerprint." So it doesn’t look like we’re any better off." But if we do that, it removes the new lines off the end of the keys. If we output that into a known host file, and then we try and SSH in instead—bear in mind, we didn’t accept the public key into a known host file from SSH directly—then we trust that key, so that’s good. We’ve managed to go from not being able to trust the machine we were talking to, to being able to trust it by this config.

A more portable approach

About those docs: When people talk about SSH keys, they mean user keys, right? No. The hint is public and private key, and obviously, for a user, you don’t do that. It turns out, there is a built-in module for this, and instead of the template that we rendered earlier, we can just render a template directly, which has the key type, and then private and public on the SSH keys, and it will do all that for us.

We don’t need the extra config, and we don’t need to write the files directly ourselves and know about the paths. This is a bit more portable, because not every operating system keeps its host keys in /etc/ssh.

If we run this, we basically do the same thing. We only need one of the Cloud-Init parts other than the template now. So we can run this, and we get the same result as before where we can SSH in once we’ve taken the known host entries.

Debugging

Let’s take a minute to talk about debugging. This is the debugging experience. There is a logfile that emits literally everything it does. This is line 600, and we’re not even at SSH service yet. It is detailed, and it’s exhaustive, and if you need to find out what it’s done, like down to the number of bytes it’s written for things, then this is the place to go look, and that’s basically it. So the debugging experience is not great, but it’s not nothing, which is nice.

Some other use cases

Here’s a bunch of other stuff that you could go and do, and sometimes instances will do bits of this for you.

If you attach a volume to an instance or a virtual machine in another cloud, then the operating system doesn’t know what to do with that. It’s by luck that it ends up with a file system and that the file system is sized to fit the disc, and you can change the size of the root volume as well, and that’s clearly not in the image.

Cloud-Init is responsible for working out what disks are attached from the cloud metadata and then creating a file system of an appropriate size on each volume and then mounting them however you configure. You can configure a bunch of that stuff. You can even configure down to the partition layout for sectors in each drive, which is useful.

We looked at installing Docker. One of the things that you can do is pull down Chef configuration and run it in solo mode when you boot the machine, which is quite a nice way of writing the config in an alternative way, if you prefer that.

If you use SSM Parameter Store in Amazon, the tree function is 95% done, and there’s no way of specifying where in a tree a machine should look. Maybe you could specify that through cloud config and write it in a file. Then, whenever you go read from the SSM Parameter Store, you can prepend that to the tree, and then you’re looking in the right section for the environment that you’re in. I’ve used it in the past to join nodes to Serf clusters. There’s a whole bunch of stuff that you could do.

Wrapping up

In summary, there’s a huge amount of functionality available that very few people will ever dig into beyond sticking a shell script in user data. It’s not very discoverable, which is a shame, because there’s a lot of good bits in there. If you want to discover the entire range of options, the only thing you can do is go read the source, because the docs don’t contain it.

In 99% of cases, you can just go copy and paste the code from somewhere else like Stack Overflow or something like that, and it kind of works. Just make sure you’re not opening up random ports without any security on them or anything like that.

But it’s generic enough and common enough that it’s worth knowing at least the basics of this. So if you don’t want an image-based workflow or you’re just doing something quickly, then maybe this is a good alternative. It’s also worth learning if you worked with someone that liked Cloud-Init, and you have to be responsible for maintaining whatever mess they made. Another motivating reason.

Thanks for listening.

More resources like this one

  • 3/15/2023
  • Presentation

Advanced Terraform techniques

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 2/1/2023
  • Case Study

Should My Team Really Need to Know Terraform?

  • 1/20/2023
  • Case Study

Packaging security in Terraform modules