With the help of HashiCorp’s Terraform, Consul, and Nomad and a few in-house tools, the Dutch National Police wrangled the complexities of their microservices architecture.
Microservices development has been common practice for quite some time at the Cloud, Big Data and Internet department of the Dutch National Police. About a year ago they found areas where we could improve.
At the time, their microservices were running on VMs that were purpose-built for each particular microservice. There was ad-hoc automation around creating these VMs. Furthermore, each microservice was deployed on a single instance and thus not high available (HA). They set out to change this—they wanted 24/7 availability, zero-downtime deployments, high scalability, and a more self-service infrastructure.
In this talk, Richard Kettelerij, a software architect at Mindloops, and Arno Broekhof, a full-stack data engineer at DataWorkz, share how they gradually migrated dozens of microservices to a highly available cloud-native container platform. The key here is 'gradual migration'. They wanted to move fast but avoid a big-bang change on both the technology and process side.
With the help of HashiCorp’s Terraform, Consul, and Nomad plus a few in-house developed components (in Go, Java, and Typescript) they managed to address their concerns. This talk will provide you with an understanding of how these tools provide value when used on their own as well as when composed together.
Richard: So, the last session before the closing keynotes. We're gonna talk about how we're using a couple of HashiCorp tools at the Dutch National Police, and also why we're using them. Just a quick intro, I'm Richard. I work for a company called Mindloops. I'm also working on a project at the police, obviously, together with Arno.
Arno: Yeah. Hi everyone. My name is Arno. I work for a small company called DataWorkz, and together with Richard we're working at the department's product line cloud, big data, and internet at the Dutch National Police. A brief introduction of how the product line is divided into five DevOps teams: Each team works on its own particular product, and each also has their own logo and name. Together with Richard, I'm a part of the Cobalt team, and we're working on various products that are used in the incident response rooms, like the one you see here in this picture.
Richard: If you called the national emergency line like 112, or in the US it's 911, then you get connected to one of these rooms and they'll take your incidents and help you further.
Richard: About a year ago, at the product line we already had a couple of nice things in place. There was a cloud environment and it's much like the public cloud, but it wasn't the public cloud environment. For obvious reasons it was a private cloud environment. It was a cloud environment based on OpenStack that was running out of the data centers of the national police. A pretty large scale environment, so it was already in place. Also, we were doing microservice development for quite some time, so we had, let's say, 30 microservices running a year back. Mostly Spring Boot applications. These Spring Boot applications were deployed to production at the end of each sprint, so we had a continuous delivery pipeline in place for automatic deployments, automatic testing, all that kind of stuff. Altogether, I would say a pretty modern engineering organization.
But there were also some things that could be improved. For instance, there was limited automation around creating new virtual machines. Each time that we wanted to deploy something and it was a new application, when we would roll out a new machine. It was stuff like clicking together the machine and user interface of OpenStack. Then Puppet came along, provisioned the machine, all that kind of stuff. But it wasn't a whole integrated solution.
Also, we had no solution for high availability, so each of our microservices were single instances and that wasn't great either. And finally, we had no zero-downtime deployments. So every time I released something, the whole system went down briefly, so for a couple of seconds, then came back up again. We could definitely improve something there. To summarize, what we basically had was purpose-built single-instance virtual machines for each particular microservice. Each microservice was running on a particular VM for only that particular microservice.
Visually, it looked like this a year ago. In the middle you see all the microservices. All single instances, all Spring Boot applications, mostly Spring Boot applications, Nginx sitting in front of it. In front of that was an Angular application, pretty modern Angular application, nowadays we're on Angular 6. And it was connected straight to the microservice virtual machines.
On the right side you see our datastores. These were already highly available, so we had a large Thunder cluster running. We had several Elasticsearch clusters running. We had a MongoDB cluster running. This was already highly available. And our goal was to make the microservice tier also highly available.
So, our goal was to make microservices highly available and obviously, what we needed to do was to deploy more virtual machines, but we didn't want to go through the hassle of rolling them out manually. We wanted to automate this further, so we looked for a tool.
Arno: Yeah. So that tool for us is Terraform. I think it was a Friday morning and I came in the office, looked at the sprint board and there it was deploying 10 virtuals for our new microservice. The whole team hated this because it was all clickety, clickety click and interfaces like Clint said in his presentation it was stage one.
And I asked one of my colleagues, "Do you know Terraform?" He said, "yeah, I've read something about it, but I really, I actually didn't use it." So, we were like, yeah, let's give it a try. And two hours later, we deployed our first microservice VM with it. It was so simple and it was easy to create it, so we started migrating all our auto services through Terraform templates.
If we looked at it after two weeks, we created a module, put in some variables, typed in Terraform apply, it goes to the OpenStack API, creates the instance, and then it goes through PowerDNS to create the DNS records, and it also signs our certificates for Puppet.
So, what have we learned from using Terraform? First of all, everyone who has ever worked with Terraform has encountered the directory layout. Please, don't make the same mistake as we did, start with a good directory layout and modularize everything from the beginning, or else you will hate yourself afterwards, because you need to move state files and all that kind of thing. Another one, Terraform doesn't have for loops and if statements (But features like these are in version 0.12), literally. From a developers perspective, you can't do a for or you can't do an if, no. You can't use the interpolation functions to just create some kind of if statements, and using count to do a full loop from one to ten or one or zero, 200. To summarize, the end result, we now had multiple VM's running our same service.
Richard: Like Arno said, we have now multiple VM's, so we're basically highly available, but now the question is, how do we make use of new infrastructure? Ideally we want to load balance across these machines. So how do we address those?
Richard: Well, the obvious solution of course is to stick a load balancer in front of it. But in our case it wasn't that easy because we were using Kerberos for authentication. Kerberos has some nice features, but it's not very cloud friendly, especially our use of Kerberos. We did delegated authentication and such.
So we had to find a different solution and we turned to another tool, also from HashiCorp, Consul. We were using Consul specifically for service discovery, so we're not currently using extensively the key-value store features. But we're mainly using it for service discovery. And I see service discovery as: you have a large address book, so we have an address book. Each of our microservice is registered themselves in that address book and we want to get the actual nodes that are associated with the microservice—we ask Consul and we get back the node information.
But it's not just any address book, it's a smart address book because Consul is able to do health checking. So when we ask Consul for a particular microservice, we only get back the healthy nodes and that's an ideal building block for building your own load balancing solution.
So when we added Consul to the mix, it looked like this [8:42]. Now we have a Consul cluster running and we have a Consul agent running on each virtual machine. The Spring Boot application talks to the Consul agent because talking to the local agent, that's the automatic way of using Consul and in turn the Consul agent talks to the Consul cluster.
But still we needed to find a solution for load balancing. Now we have a directory with all the healthy node information and now we need the load balancer. So we decided to use client-side load balancing, so for service-to-service communication, that means communication between our Java-based microservices, we're using Ribbon from a library from Netflix. We're not using Ribbon directly, we're using it wrapped in a library called Spring Cloud Consul, which integrated nicely with our Spring Boot applications. And Ribbon basically is a Java library with a pretty advanced load balancer built into it.
And that works pretty nicely for our client-side load balancing. But like I said, we also have browser-to-service communication, which can't go through a central load balancer. So we used the ideas that we learned from Evan, and basically used client-side load balancing for our browser-to-service communication. It's a little unconventional, but it works in our case and it was actually the only solution we could roll out given the constraints that we had.
So when we look at this [10:42]. You obviously have a web server with Consul templates and our SSE server running. And the Angular application receives node information and it's able to direct traffic to one instance of service-bar and on different accounts it's able to direct it to a different host. So basically you have client-side load balancing all over the place between microservices, but also between browser applications.
So how does it work in detail? Here [11:18] you see the Consul template configuration. Consul template is able to supervise another process, so in our case it supervises the tool that we wrote ourselves, the SSE server. And each time when something in Consul changes, the template renders a new JSON file and that JSON file is flushed out to the client and they are able to handle that. We actually open sourced our SSE server, so if you want to take a look, it's only the Police GitHub account.
Richard: So, some lessons that we learned. You might think that we have just one big environment with everything in it, but actually we have multiple teams like Arno said and they all have their own tenant in our OpenStack environment. So they're all working on their own products and they all have their own tenants. And within each tenant they have different environments, like a testing environment, a dev environment, a production environment. And we use Consul also to direct traffic between tenants because in some cases we might want to use a microservice written by a different team.
So how did we solve it? Actually we used the data center abstraction in Consul, you really need to think carefully about how you're going to use it, but we used the data center abstraction Consul to separate the different environments and the different tenants. So basically we have the whole environment replicated in each environment and we use Consul to join them together and look up services between teams, actually between tenants.
Another thing is, like I said, we need to receive external requests or traffic from browser applications on our microservices and we don't have a central load balancer, or API Gateway. It wasn't possible given our constraints. But that means that we needed to register the public IP, or public host name in our Consul directory so the browser could contact it. But of course the node actually isn't aware of its public host name. So we can't use the service registration in Spring Cloud Consul to register the service in Consul. So we actually used the Spring Cloud Consul mainly for service lookups and service registration is handled by an external process. So it's all something to think about.
So basically we have high availability taken care of. We have multiple nodes, multiple virtual machines, we have Consul with all the information about the nodes, and we have a load balancing solution in place. So, goal achieved. Everything's highly available.
But there were still two things missing. For instance, we had no zero-downtime deployments. Basically every time we released something, even if it was highly available, we took down the whole thing and started it back up again. And still there were single-purpose VMs, so each VM was only capable of running that particular microservice and we wanted to abstract on that. So again we looked for a solution.
Arno: We looked into many different tools like Kubernetes, Apache Mesos, Docker Swarm, but at the end we choose Nomad. So why did we choose Nomad?
It's simple. Download the binary, put in some configuration, and it runs. It just works. Also, it can handle multiple workloads and it has first-class Consul integration. And it can do the multiple workloads, let's say a Java file, exec, it can use Docker. And we had a constraint due to Kerberos, so the multiple workloads were very very important.
So how it the architecture now? As you can see [15:14], we replaced the nodes in the middle for just generic nodes and we replaced Nginx for a sidecar proxy that we built in Go, specifically for our own needs and we call it Emissary. Side note, it's just another word for proxy like Envoy is. And it takes care of the Consul registration and it does SSL renewal with our on-premise certificate authority. And this all was previously done by public scripting.
So what have we learned from using Nomad? Every program that you're running with Nomad that uses Java driver or the exec driver, runs in a chroot. So in other terms, it's running isolated, that means that for example if we are deploying a service that needs to bind port 8, needs to bind port 443 for HTTPS, it can't without elevating the privilege and you're exposing system resources. And therefore we just created a simple rerouting rule, on every node for the traffic—when it comes into the machine, it just gets rerouted to a higher port. So every job with Nomad can run as the user "nobody".
Richard: Of course you have lots of microservices. We started off with 30 and now I think they have double that. And some of them are very similar, so we wanted to avoid duplicating a lot of Nomad configurations, so we wanted to use templating. Template our jobs.
So we first started off with Terraform, which we already used, for templating purposes, but now we have switched to another tool called Levant. I know James is probably here. Shout out to James.
It's a great tool for templating our Nomad jobs. We also gave it some code to support multiple files, so we can have a generic template for our goals and a specific template for our goals. It works out nicely so nowadays we have this simple configuration and it templates our Nomad jobs.
Richard: So in conclusion, why did we do this? Why did we start using all these tools? Well our initial goal was to offer high availability, so I want everything to be highly available. The first the logical thing to do, our datastores were already highly available, just the services in front of them weren't, so that is not what you want. We wanted to have everything highly available.
And in the process actually we achieved a whole lot more. So what did we gain besides high availability? We gained location transparency, mainly due to using Consul. So location transparency, it doesn't matter anymore where one of our microservices is located. It could be in the same tenant, could be in a different tenant, it could be in a whole different data center, it doesn't matter. Consul knows where our services are. It's our main address book and we just ask Consul and we get back the physical location of the server.
Another thing is actually zero-downtime deployments.
Arno: We now have zero downtime deployments so we can do rolling upgrades because each time a new version gets released, and that's every two weeks because our sprint takes two weeks and at the end of the sprint, we just release everything in a new cycle. So we can do it in a rolling updates manner using canary deployments, or blue-green deployments. And it doesn't take down the whole system. So that's one thing we really wanted to have and now we have it.
Richard: And finally, we are much more resilient against machine failures. So if one of our machines dies, of course it's highly available, we are much more resilient, but if one of the machines dies Nomad just reschedules the job on a different machine. So currently we have been over provisioning our tenants so we have some spare machines running and each time something dies, Nomads able to just schedule the job on a different machine so that's great.
So one important thing to note, is that we didn't set out to use all these tools from the get-go. We had the problem, we looked for solutions, adopted a tool, and further along the way every time we encountered a problem, we adopted a tool. And I think that's one of the nice things about HashiCorp tools, that you can use them in isolation, but you can also use them combined together.
So actually in our case it was a gradual migration. We started off using Consul, then Terraform, Nomad, etc.
Richard: So some future work that we're looking into, we talked a lot about Kerberos because it's one of our main constraints that we had. Well Kerberos is proven technology, so that's a plus. The downside is it's mostly assumed for static infrastructure, so even though it's a very secure protocol, it assumes a static infrastructure and we're running in a cloud environment and as we all know, the cloud is dynamic in nature and that just doesn't work very well with Kerberos.
So we worked around it a couple of times and now we're really sick of it and reducing our use of Kerberos. Kerberos will still be around a couple of places, but it will not be so dominant like it used to be. So we're migrating off that. We are actually migrating to more open IP connector kind of stuff.
Arno: And also besides using Java and exec driver, we also want to make use of the Docker driver to schedule our already containerized jobs.
Richard: Yeah, we are actually already using Docker currently on Nomad. But we want to use it a lot more. Another thing that we are looking into is managing our datastores with Nomad, so we have a couple of datastores, Cassandra, Elasticsearch, MongoDB, Kafka, that we are running. And we are looking into how we can leverage Nomad to manage those processes.
Arno: And also we can't wait to get started with Consul Connect.
Richard: Yeah. That's really great.
Arno: I don't know where Nick is but I think he's trying to be my HashiCorp on the machine. That's really one thing we can't wait to just use. So, in conclusion...
Richard: So, thank you. Thank you. If you have any questions, meet us afterwards.
Building NAB Engineering Foundations with Terraform Enterprise
Protecting the Team and the Product With Vault at the UK DWP
Accelerating Cloud Adoption in the Highly Regulated Public Sector
Lessons Learned: Migration to the Cloud in A Highly Regulated Public Sector