Case Study

Running Windows Microservices on Nomad at Jet.com

Working with Windows containers was dicey a few years ago, but Jet.com was able to find some creative solutions for running Windows containers on Nomad.

The world is rapidly moving towards containerizing everything, mostly by using Linux cgroups and namespaces. But, Windows has kind of been a second-class citizen in this ecosystem.

While great progress has been made in the Windows container space, it has only recently become generally available. In 2016, when Jet.com (Walmart) started its migration to Nomad as its microservice scheduler, Windows containers were still in their infancy and the only Nomad driver able to run .NER was raw_exec.

In short, running and scheduling .NET Framework microservices is not easy. In this talk, you will learn: 1. More about Windows Containers and their limitations 2. How Jet constrains Windows raw executables running on Nomad without Container support 3. More about the Windows 32 API and how you can hook into it in your Go programs without resorting to cgo

Speakers

Transcript

Thank you, everybody, for coming.

How many of you are Nomad users already? All right, a fair few of you. Of those, how many run Docker with Nomad? All right, pretty much the same group. Of those, how many of you run Windows Docker? About 5 maybe? Considerably less.

Also, are any of you familiar with the Go language? Do you have some experience with it? All right, I think I probably have something for everybody in this talk.

My name is Justen Walker. I'm a software engineer on the microservice platform team at Jet.com. Our team develops the platform on which Jet's microservices run. We do this by combining almost the entire Hashistack. We use Docker. We use Consul, Vault, Nomad, and we use a lot of Go to glue it all together.

I'm going to first cover containment on Windows and what your options are in order to contain your processes. Next, I'll describe, in as much detail as I can, what job objects are and how they're relevant to containing your processes on Windows.

Then we'll introduce an open-source project that we wrote called Damon that makes using job objects more practical on Nomad. And finally, if you're still awake and we still have time, we're going to dive into using the Windows API from Go without resorting to calling out the cgo. That's going to involve a lot of weird-looking Go language stuff, a lot of very low-level things.

I hope you're as excited about it as I am. I clearly spent too much time reading Microsoft Developer Network documentation and I want to share my pain with you.

Containment on Windows

What is a container? What does Docker actually do? Well, I found this diagram on some random website, and it says that containers provide basically 4 capabilities.

The first is control groups, the ability to partition resources like compute and memory, and to constrain them so that the processes don't monopolize the entire machine's resources.

The next thing they provide is namespaces. Namespaces are a way to give a container an isolated view of the machine, so that its processes can't see other machines' processes. It can't write to other machines' disks. It has its own partitioned network stack.

Third, it has layer capabilities. This is basically your UnionFS filesystem, the way that you get immutable images and you can write on top of them.

And finally, other OS functionality.

We can boil this down to 2 things: Containers provide resource constraints, preventing a process from consuming all the resources, and isolation, a way to make each container or each process view its world in isolation without being able to interact with the other processes that are being run.

What are your containment options on Windows? On Linux, it's basically just LXC via Docker or runC, but on Windows it's not quite that. It may be no surprise to you that Windows does have containers. In fact, Microsoft likes containers so much that they made them twice.

The first variant is called "process containers." These are containers that operate very much in the same way that Linux containers do. They share the kernel with the underlying host OS. And they provide all they can of capabilities that you can come to expect from the diagram in the previous slide.

The next kind are Hyper-V containers. Hyper-V containers are an even more isolated form of containment. These, unsurprisingly, use Hyper-V to provide that isolation.

Both of these are accomplished by interoperating with the Docker library, the Docker binary, and the Docker protocols in order to make it easy for you to interact with these containers in the same way that you already interact with your Linux containers. They use the same protocols, the same tools that you're used to, and you can just get right up and running.

A lot of what I'm going to be covering in the next couple of slides was covered in more detail by John Starks and Taylor Brown at DockerCon 2016. If you want to deep-dive into exactly how they implemented Windows containment, current as of 2016, obviously, I recommend that you view this talk. It's really informative. But I will try to do my best to summarize.

Windows Server containers

The first kind is Windows Server containers. But Windows is a lot different than Linux. It's always been designed as a highly integrated system. Windows does not have Linux syscalls or any kind of syscall variant. All of your interactions with Windows API are done through DLLs. They're done through APIs and services that the OS provides you.

The internal workings of those DLLs, they're not really documented. They're tightly coupled with the OS services that are running on the machine.

What does this mean for process containers? It means that while you can technically share the kernel with your process, you have to take along this baggage with you. You have to take along at least some amount of system processes and services and DLLs in order to make your process runnable.

It doesn't really matter what language you use. Something eventually has to make a Windows API call, and that will require some service to be available to make that call. This decoupling is pretty inescapable, and it has some pretty critical impacts in the portability of Windows Server containers.

One of the first portability concerns is that Windows Server containers are not portable between builds of Windows. Practically speaking, this means that if you want to run a Windows Server container on 2016, you have to have built that container from a 2016 base image.

If you run that on, let's say, 1709 or 1803, it's just going to not work. In fact, Docker is going to stop it from running. This is simply not a limitation you have in Linux. You can just make whatever container you want in Linux and run it on any kernel version that you want. This is a pretty stark deviation from the portability of containers.

Secondly, Windows containers are by and large much larger on average than Linux containers. There is no from-scratch option. As I said, you have to take these DLLs, you have to take these services along for the ride. This coupling is inescapable.

However, Microsoft and the Windows team, to their credit, are doing a lot of work to try to pare that down to as small of a footprint as possible. But, of course, in the pare-down, there are some trade-offs.

For example, in the Windows Nano Cores, you don't get the .NET framework with your image. You get a stripped-down version of PowerShell based on PowerShell Core. If you rely on the .NET framework and not .NET Core, this could be a problem, and you will have to go with the more fully fledged image.

Hyper-V containers

If you're just putting a binary, like a Windows binary, in a container, maybe that doesn't really matter to you. But if the portability concern of "the builds must match" matters to you, then Microsoft created another variant of container called Hyper-V containers.

This container of isolation mode, you just run --isolation=hyperv. This spins up a VM for each container that you run. You can see from this diagram that now we're no longer sharing the kernel. Now you get your own VM for your container. This is true isolation.

This makes containers portable between builds of Windows, but it comes at a pretty big cost. Hyper-V containers run within their own virtual machine, and that sounds crazy, right? Didn't we just go to Docker because we didn't want to run our services inside of VMs? We wanted it to be lightweight, so this is crazy.

But Microsoft did a ton of work to make the VMs as small as possible and start up as fast as possible. In fact, what Microsoft does is they spin up a VM that they take a memory snapshot of. And, then, all of your containers clone off that snapshot. Basically, none of your containers ever have to boot. They just start and run your process.

It's a little bit slower than process container isolation, but it's manageable.

One other caveat is that, if you're on a cloud environment, this requires nested virtualization. You're already in a VM if you're on Azure or Google Cloud or Amazon. So if you want to use Hyper-V containers, you already have a hypervisor; you need to have another hypervisor on your hypervisor.

But for all of those caveats, it just kind of works. This is just how you run a Docker container on Windows. It's no different from Linux. You just put driver = "docker" in your config. You point it out, obviously, in a Windows image, not a Linux image. And then it just kind of works. It's because they have a common API and they use the same protocol to interact.

Nomad has the same fingerprinting drivers. Everything just kind of works.

But, I hear you: Do I need all that overhead? Maybe all I really care about is resource constraints. I don't really care about isolation. I have a small team. We all get along. I don't need my own VM for each of my processes.

Do I want VM inception? I think it might be a bit slower if I have a VM in a VM. If I test that out and that proves to be true, maybe I don't want that. And if I don't want that, and I go to Windows process containers, then I have to rebuild my image every single time I upgrade Windows to the next version. So that's crazy.

Maybe there's some way we can still get some of the benefits of the container ecosystem without using Docker for Windows.

One possible answer: Job objects

Job objects allow groups of processes to be managed as a unit. They enforce limits such as working set size, which is just memory. And process priority, and terminating a job when all the processes associated with that job.

Job objects are nameable, securable, shareable resources that control the attributes of processes that are associated with them. Job objects are basically cgroups for Windows. You can enforce your memory constraints here, and you can force your CPU constraints here. And you can do quite a bit of stuff with just this.

But job objects alone can't provide isolation. That comes in the form of this other concept called "namespaces" or "silos." That's a complex and slightly under-documented part of the windows API. I won't be covering it in this talk.

How do you use job objects? You can only really do it through C or the bindings of C. These are .NET bindings, so technically, you can also use PowerShell, but I've never found a good example, so I wouldn't recommend it.

What you do is call CreateJobObject. You get a handle to a job. You then set up your job object with the constraints that you require. You create the process in suspended mode, because you can't create a process in a job in a single atomic step. You have to create a process and then attach it. You don't want your process to run ahead of you. That's why you create it in suspended mode.

You get a handle to a process and a handle to the main thread. You then assign that process to the job object and finally resume the main thread. And you are running your process in a container of sorts.

The real answer is Damon

Job objects are what we need, right? But they're cumbersome to work with, and I don't think that learning these API calls ranks high on your to-do list. There just has to be a better way. And this better way, I believe, is Damon.

Damon is a standalone binary, written entirely in Golang, with no cgo dependencies. Damon does the opposite of Nomad. It's cute: "Damon" is "Nomad" spelled backwards. It's not named after Matt Damon.

Damon does all of the steps that I told you, in the previous slide, for you. The only thing you have to do is put the exe in front of your command line. The next slide shows an example of this.

But it's no substitute for containers. It's only going to give you some form of resource constraints. If isolation is important to you, this is not the tool. But if all you need is to prevent your processes from monopolizing your host, then stick around.

You run Damon by downloading it as an artifact. It would just be part of your Nomad task directory. Or you could put it on your server in the path directory or somewhere that Windows knows to look for it.

Then you put it as the command part of your config. This is using the raw_exec driver on Windows. All raw_exec does is rawly execute your command line. So it executes Damon.

Damon looks at the arguments list. And then Damon runs your program and does all that job objects for you. And you can set the constraints just like you're setting them in Nomad currently. You just set CPU memory, and Damon will interpret those things through the environment variables that it gets from Nomad and set those constraints up in the job objects for you.

There's a ton of options that Damon supports, and they're documented in the README on GitHub if you want to know more.

Damon also does stats. If you do want to get some process statistics, you have to do a slight bit more work. You have to enable a port labeled "damon." This is configurable, but, by default, if you don't want to add more environment variables, just make a port named "damon."

You also have to expose this service somehow to Consul, probably, so that your Prometheus instance can scrape the instance and knows how to find it. It exposes CPU, memory, I/O usage, using the basic accounting APIs in the Windows API.

It doesn't use WMI because we found that querying WMI for lots of different processes on the same host usually resulted in some kind of deadlock, or livelock. So we decided against doing it that way.

How we made Damon

That's Damon. But if you're still interested, we have quite a bit of time. So my next couple of slides will be about how we did this, how we use Go to interact with the Win32 API and make Damon.

I'm going to go over how to create a job object in Go. I'm going to go over all the steps using the JobObjectCreation from loading a DLL to discovering how you need to call the Windows API signature, creating C strings and wide strings, because C strings are not like Go strings. You have to do a little bit of work to make them compatible.

You also have to create your own structures that are compatible with C. They're just regular Go structs, but you have to be careful about how you type them.

Finally, I'm going to show how to call the Windows API procedures with those arguments.

If you want to know more, I wrote a blog post on our Jet Tech blog that goes into a bit more detail, which is linked here.

Creating a job object in Go

Your first step is to load a DLL. Go has this thing called NewLazyDLL. It just loads a DLL that you can then create processes out of.

Fortunately, Microsoft documents which DLLs are associated with the API call that you want to use. You just go on MSDN. You look up the API call you want to call and it will tell you kernel32DLL. Once you load that, you create your process. There are usually 2 variants for things that require string arguments, because string arguments in C are a little bit different than Go.

A long time ago, we didn't care about Unicode. Apparently, we didn't care about characters beyond the ASCII set. So we had an ANSI-type variant. That's what the A stands for at the very end: CreateJobObjectA. Basically, ASCII with some extra Microsoft characters tacked onto it.

But then we cared about Unicode at some point. So we decided to make wide characters. And that's the W variant there: CreateJobObjectW. Depending on which one you use, you have to make different types of strings. How do you do that? We'll get into that soon.

But first, we're going to look at the API call signatures. The CreateJobObject, I'm just going to take, for example, CreateJobObjectA, which takes ANSI strings. In the MSDN documentation, it requires a pointer to a job attribute structure and a string.

Lucky for us, the documentation says that both of these are optional, so you can pass null for these and be fine. However, let's just say we wanted to at least give the job a name. How do we do that? We can't just pass in a Go string to this thing.

There are the 2 variants of strings. The first variant is a C string. If you're familiar with the C language, strings are just pointers to contiguous blocks of memory that terminate with a null byte. We have to simulate that for C.

The way that we do that is we just make our string into a byte array. We tack on a null and we return the pointer to the first byte in that array.

You notice we're not returning the pointer to the slice, we're returning the pointer to the first element because that gives you the pointer to the backing array or the first element of the backing array. That's important.

For Unicode strings, it's not utf8; it's utf16. Every single character is at least 2 bytes long. You have to tack on a double null at the very end of it and make it into a rune. But the whole thing is pretty much the same thing. It's just a uint16 instead of a uint8.

Creating C-compatible structures

Follow me to make an analogous Go struct that mirrors a C struct.

I put in 2 different structures, above and below, on this slide. You can see the first one is the C variant of this. This is what's found in the header file in the Windows API. We have to make an analogous structure in Go in order to work with it.

We can look up in this Windows data types reference what the sizes of those things are. And we can try to map those to the Go primitive types. The other thing that helps us out here is that all pointer types are uint pointers in Go, so if you see a pointer, you can just say uintptr and be done with it. All else you have to know is if it's 32, if it's signed, if it's 64-bit, etc.

How to call the Windows API procedures

Now let's go over calling the API from Go. You have to pass each argument into the call as a uintptr. Every argument is treated this way. This is just how the syscall library works in Go.

But the unsafe.Pointer is special. This cast allows you to make any pointer into any other pointer, without any safety guarantees whether that's valid. Since Go is garbage-collected, standard Go pointers don't point directly to places in memory. The Go runtime is free to move memory around on a whim.

When a pointer is converted to a uintptr, it becomes just a number. But that number may or may not represent the physical location in that processes' memory that it once did.

Because of this, you have to call syscalls in a very peculiar way. You have to use this uintptr(unsafe.Pointer) chain in the argument list. And this signals to the Go compiler that, for the duration of the syscall, you can't change where this pointer points to. This guarantee allows all the syscalls to treat those pointers as if they were regular unmanaged memory.

The other ways in which you can use unsafe.Pointer are documented in the godocs. We are using an officially sanctioned version that's uintptr chain inside of syscall. There are many other ways you can do this, which I'll go into in a little bit.

To recap, I showed you that Docker on Nomad pretty much just works, if you're willing to put up with some of the peculiar differences that the Windows container ecosystem imposes upon you when you run Windows containers.

We also covered, if you don't want to have that overhead imposed upon you, that job objects are sometimes an acceptable alternative to that, if you only need to have process restrictions and not isolation.

We then covered how you can use Damon in order to constrain those executables in Nomad without writing any code. And finally, we did a little bit of DIY Go Win32 API hacking.

Some peculiar Go code

And if I had no time, this would probably be where I ended my talk. But we have a bit more time. So I'd like to go into some even more peculiar Go code if you're going to be with me.

Sometimes you have to work with raw memory. And a lot of these calls basically do the same thing. You have to create a buffer that the Windows API call will fill with some actual memory or some actual structure.

In order to do this, you have to first call with a zero buffer. This just says there's nowhere to put my memory. It returns an error, but it returns a special type of error called "error insufficient buffer." In addition to doing that, it changes the value of your buflen variable to the size that you should have passed it.

Now you can loop around and extend the buffer to that bigger size. Then you call it again. Unfortunately, it might still be wrong. You have to keep doing this over and over and over again. Usually it works, but sometimes it doesn't. It requires this repetition.

But eventually, you'll get a success or another error. And then you'll check for that error and do whatever you want. But once you have this buffer, it's filled with just raw bytes. How do you work with that? They're just bytes. How does Go know how to index into those fields?

You use that unsafe.Pointer that I mentioned before. You cast the pointer to the backing slice array into a pointer of a Go structure that shares the same size. And this allows you, as the developer, to index into that raw byte buffer and treat those entries as just regular fields.

To illustrate this point a bit further, once we have a buffer pop up at that API, we can get to the backing array by taking the pointer to the first element of that array.

And then we can use the unsafe.Pointer to cast that light pointer to a structure that has the same size, and that didn't require any different allocations. Now we can access the bytes raw using the raw byte slice, or we could access those fields interpreted by the backing structure that we've defined.

You can see in the slide, we have a struct that has 2 fields. Those fields are 2 bytes wide. And now you can index field 1 and field 2, and they may have some different semantics. For instance, they might be uint16, for example.

There's another type of API call that's important to know, that's working with any size arrays.

Sometimes you'll get API structures that have embedded data in them. And Go doesn't have a way to represent variable size arrays. I mean, it does, but slices don't translate directly into C arrays. You have to do a bit of work. Actually, you have to do quite a bit of work. Quite a lot.

What we have to do is get the pointer to the first byte like we do to get the backing array. And then cast it to the group result or cast it to the structure. But then we have to get the pointer to the embedded array.

With this embedded array, we're basically cheating. We're saying that this array is of size 1 because we need to be able to tell the Go compiler that this array has at least 1 element. Because we can't just say a 0-size array, we have to say at least 1.

That allows us to take the pointer to that array. And because we're using arrays and not slices, we don't have to get the pointer to the first element of the array. We just get the pointer to the array.

Then we cast that, using unsafe.Pointer, to a pointer of an array that's an immense size. This is just a temporary measure, because what we're going to do immediately afterwards is use the built-in Go slicing functionality to create a slice that is backed by that array pointer.

We're going to do that by putting the slice operator at the end, and set the length and the count equal to the structure's capacity or whatever it gave back.

For example, in this, we have a count element. We are using the slice operator to set the length into capacity. And it's important to set both to the actual count of the groups in that structure. Now that group's variable can be worked with like any other slice.

To illustrate this, we get back a group structure, and that group structure only covers the first preamble of that space in memory. So the buffer that we passed in is a bit larger than the group result that we got.

What we had to do is do a little bit of casting work to get the pointer to the group field. And use a little bit of unsafe.Pointer magic to make that into a pointer to an array of a large enough size so that the slice operator wouldn't yell at us for slicing an array that was too small. And that's basically it.

Thank you for coming to my talk. Enjoy the rest of the conference.

Be sure to check out our GitHub, github.com/jet/damon, if you want to learn more about the project. Thank you very much.

More resources like this one

  • 2/1/2023
  • Case Study

Should My Team Really Need to Know Terraform?

  • 7/21/2022
  • Case Study

Using Terraform Enterprise to support 3000 users at Booking.com

  • 2/22/2020
  • Case Study

Terraforming RDS: What Instacart Learned Managing Over 50 AWS RDS PostgreSQL Instances with Terraform

  • 10/7/2019
  • Case Study

Transforming the Management of Application Configurations & Secrets at 24 Hour Fitness