Bol.com is the largest online retailer in the Netherlands and Belgium offering, as of May, a catalog of 9.3 million products to its 5.4 million customers. Powering the online shopping experience is a collection of more than 130 different applications and services that all need to work together. The bol.com team is constantly looking for ways to improve; a mission which has taken us from a collection of several software monoliths to a service-oriented architecture. Recently, we decided to start using HashiCorp's Consul for dynamic service discovery.
In our previous setup a configuration update would take 15 to 20 minutes before the changes propagated across the system. Using Consul, these are now made in near real-time which enables us to develop more resilient software systems, faster.
This post was originally published by bol.com on their official blog. We're republishing it on the HashiCorp blog so future readers can easily find posts about the usage of our tools in real-world environments.
The Problem - Static service discovery just can't cut it at scale
Last year we built and went online with an infrastructure management system whereby all services had logical names in each environment and developers configured their services to talk to each other. The entire system was, however, completely static. Everything was derived from a single source of truth (a CMDB) which engineers had to edit by hand. Editing was very easy and Puppet-based automation took care of the rest, but the fact that it was not dynamic turned out to be a bottleneck as our software landscape grew.
Want to add another instance of service X to the load balancer? Fire up the CMDB app. Want to change the hostname for a particular service? Well, that means re-provisioning a bunch of VM's from scratch and changing all the properties that reference the old hostname across the testing, acceptance, staging and production environments. Static service registration served its purpose, but bol.com needed a dynamic solution.
It just wasn't dynamic enough. What we really wanted was for services to be able to find each other on the fly, irrespective of what hosts the services were running on. In short, we needed service discovery.
Provided that your services support it, not having to know exactly what host a service is running on means that you can scale and move instances in case of failure. Achieving this portability with VM's is feasible (as Netflix proved on AWS using tools like Aminator) but we chose to go with a container-based solution.
The recent rise of Docker has made us question if we really need a full VM to isolate workloads. Having containers running single processes as our unit of deployment, instead of a VM with an entire OS stack (and all the necessary tooling) was very appealing. It makes our deployment process more lightweight, allows developers to easily run production versions of apps on their laptop, decreases deployment time and can significantly boost the utilization of production servers. In addition, the layered Docker image format and registry provide a very efficient and easy distribution mechanism for container images.
Consul is necessary to orchestrate Docker containers at scale
Containers alone however don't provide all of the aforementioned benefits. We needed some kind of orchestration software to decide which containers would be run where and when. As the basis for our new platform we initially looked at using Kubernetes or Fleet but eventually chose to go with Marathon running on top of Mesos. We liked Mesos because it is a battle hardened, proven cluster manager that really delivers on its promise of essentially turning your data center into one big computer. We chose Marathon as a task runner on top Mesos because it has a solid API that our internal tools can talk to.
Now that our services could be run on our cluster we needed service discovery. We first considered rolling our own on top of etcd or Zookeeper before discovering that Consul is easy to get running, provides a simple REST API that speaks human-readable JSON, supports ACLs, and, as a pleasant extra, it actually has a nice GUI.
Furthermore, the DNS-based discovery feature seemed like a nice way to let legacy applications integrate with the new system. Consul's distributed key-value store and health checking were also features that we envisioned using. Finally, Consul seemed to be gaining wide adoption and has the support of a rapidly growing community. We were confident that this wasn't a tool that would turn into abandonware anytime soon.
How is Consul actually being used at bol.com?
We are currently using Consul for service discovery within Mayfly, our user-story based development platform. Because the platform is under heavy development and has an explicit mandate to use emerging technologies, Mayfly was the ideal place within our organization to try out something like Consul. The 6 services that constitute Mayfly were the first to be "Consul-enabled". Currently we're only using two of Consul's features; service discovery and simple health checks.
We've built an extension to Backspin, our in-house web services framework, that lets services register themselves with Consul and then find each other in a completely dynamic way.