The NCBI's legacy migration to hybrid cloud with Consul and Nomad
The National Center for Biotechnology Information (NCBI) provides a case study for their 4.2M-daily user platform.
How do you migrate a large, diverse legacy portfolio to a modern, continuously evolving DevOps pipeline?
How do you ensure smooth transition from 100% on-prem to a hybrid of on-prem and multi-cloud?
How do you test complex, interdependent services, while ensuring reliability and availability?
Kamen Todorov, the DevOps program head at the National Center for Biotechnology Information, offers answers to these questions in this case-study presentation. He shares the NCBI's recent migration experiences using technologies including cluster scheduling (Nomad) and service mesh (Consul).
NCBI serves about 4.2 million users a day, at peak rates of around 7,000 web requests per second. As part of the National Institutes of Health (NIH), NCBI has been developing databases and information resources for the biomedical community since 1988. It has gone through many technology cycles while maintaining long-term archival resources, such as PubMed and the US DNA sequence database—GenBank.
DevOps Program Head at NCBI, National Center for Biotechnology Information
This is going to be a case study, a case study that you guys can learn from. And we can learn from you if you get in touch about any of these issues—don’t hesitate.
The context is this organization called the National Center for Biotechnology Information, which is part of the Library of Medicine, part of National Institutes of Health. NCBI has been around for 30 years, and it’s the home of the United States DNA sequence database, GenBank. But also services like the BLAST services, which allow you to compare sequences against each other and against a bunch of databases; PubMed, which is something that anyone in a biomedical community is using on a daily basis; PubChem; and a bunch of… it’s a long tail of other services that we have. And everything is growing exponentially at NCBI to the point where it has gotten boring—at every presentation you look at exponential slides and it’s not even interesting anymore.
But here’s just an example of how GenBank, which predates even NCBI, grew in the ’80s, exponential growth into tens of millions of bases. And just to explain what they mean by bases, it’s organic chemistry, ACTG, forming base pairs, forming your DNA. So 10 million bases in the ’80s, gigabases in the ’90s, exponential growth even steeper.
Then it becomes linear, but that’s only because we’re switching technologies, and people started doing whole-genome sequencing, and again, exponential growth. Then they started doing sequencing of short reads, really short sequences, and then we grew up into the tens of petabases. If you look at these particular three databases, you don’t even see the other two, that’s how big the Short Read Archive is. You can only see on a logarithmic scale; you guys understand the logarithmic scale.
Here’s another logarithmic scale of how sequencing cost is dropping over the years, starting from $1,000 per megabase, all the way down to less than a penny. And here’s Moore’s Law, so that gives you an idea—Moore’s Law is exponential decay of cost—gives you an idea of how biotech is evolving even faster than tech.
Then we have a lot of data on disk and tape, and we’re looking into the cloud. Sort of creative solutions of how we could leverage the cloud, get people to submit to the cloud, and so on. But that’s a separate topic. I’ll give you a few more metrics such as daily users and page views and daily downloads and so on. And I’ll really focus on the topic of services, and how does NCBI remain resilient in the face of all these technological changes and evolution.
As an engineer, the way I think about what I do, or what makes me tick on a daily basis—building and delivering these products and services of high quality—what makes me happy is to see them used. As engineers, we might be excited about the internals, but we don’t really show the internals, we show the interface of the products and we make a promise of high quality, and we’re setting expectations when we do that.
The world has evolved, and the world expects this high quality, and expects things to just work. [But] inevitably they do fail, these services, and we want to be fixing them. And the thing is, first of all, how much flakiness is there going to be? And second, how are we going to be fixing them? We certainly don’t want to dive into machines, and the cowboy way is to SSH in and do a bunch of fixing. We want to adhere to some principles.
For that flakiness there is service-level agreement, which is basically a contract between the customer and the service provider. Then for the rest of the practices, there are their principles such as immutability and automations, so hopefully not diving into machines.
Then when we go about building these products there are things where we’re adding value from customers’ perspective, and then the rest is basically the cost of doing business, the cost of meeting that high expectation that customers have.
At NCBI we have a lot of projects and we let developers take care of their stuff. The flip side of that freedom is that you end up with a lot of slightly different solutions. And in operations they look like different animals, and there’s a huge number of them. And there’s all this flakiness around causing all sorts of fire situations here and there on the edges.
And we’re constantly busy with that, so at some point we evaluated all that and, although we have this pretty good uptime for the core services, it is pretty expensive for us to be doing things in this particular way, and we didn’t have SLAs in between the services. We made a business decision of taking all that and basically collapse it, collapse the cost, amortize it, make some room in the budget for developing new things, and at the bottom there’s enough room in the budget to grow a proper DevOps solution, and grow a team around it, and so on.
That’s what we started doing like a year back, and we already had a few things going on in terms of a DevOps pipeline. Basically what we said is, “The green layer is where you want to be, and there are some previous things, and everything else, that we want to shift up and move up.” And providing all sorts of incentives, and this incentive structures and schemes in order for people to do that over the years.
It’s not really realistic to force-shift everything, but by providing incentives such as, you know, for new applications you want to be very quick in creating them, just say what you want, attach a budget code so we know whom to charge at the end in terms of accounting. Then you get an application and it’s already operated, and it’s monitored and all that. It’s the perfect application: Hello, world with testing and everything and you just start hacking on it, and now you start breaking it and you got to update the testing and so on.
This is what developers really like, on the one hand. And what we really like, on the other hand, is that we get to control how this is done, and we could evolve this infrastructure underneath.
Then we had a conceptual discussion around services. Your service in the middle, you have some users, you have some dependencies, and then some dependencies have their dependencies of their own. And you have maybe users that are automatic users, and maybe they have their own users. Your service is in the middle of this whole thing.
And another view of this is there’s the complex dependency graph and then there’s also your immediate neighbors. And you’re participating in different routes, so it can get really complex.
Then what is a deployment? A deployment traditionally at NCBI has been: You deploy on top of the existing thing. Some people do it in various ways but a good percent of the population was doing that. When things break you do a rollback by deploying the previous version if you have it around, so now we have to keep a repository of things.
Things can get complex if you have a bug fix. You deploy, and now what do you do? You break something else—do you deploy the previous version? Of course, you’ve got to do a lot of testing, so QA is an initiative that we started like 10 years ago with a proper QA staging environment where we test all sorts of things.
But it has become a bottleneck: You can only test a final set of combinations of versions of services. You cannot really do multiple sets at the same time where they don’t overlap.
A better practice is blue-green deployment—or red-black: Whatever people call it, you guys are familiar with it. You’re doing a test on a site, so the two versions can coexist in that environment. And you could do any number of such combinations—unlimited. Blue-green is just a matter of having the ability to have this routing table available to you, where you switch from one version to another. Then you could discard the previous version of your application once you have the confidence that you no longer need it.
Then another version of this, of course, is the canary deployment, which gives you similar functionality, but now you could gradually shift the traffic from one version to another by using, for example, weights in your routing table. Then you’re monitoring all of that, you could do it by humans monitoring, but you could also do the whole thing automatically by having some calculating of a canary score on a metric such as CPU, memory, latency, error rates, and all sorts of things like that. Then calculate deviation from the previous version, and automatically roll back if you go beyond some threshold.
We started building this whole pipeline and talking about it…
The first thing is your code repo, which essentially means you can control what the code repositories look like, but in our case we have a lot of existing things. It’s just a given—it is what it is.
Then people are running some processes. You’re familiar with how Heroku did it back in the day, introducing Procfile, which is essentially the minimal possible way of describing a process. And that’s what we adopted, and we were really successful in the sense of—it was easy to adopt, penetration was very easy.
Then the other principle, also by Heroku, is this 12-factor principle, or one of them, of splitting your configuration from your code and allowing to basically deploy into unlimited number of environments, including you could deploy your branches, and so on.
We asked people to put their configurations into .env files, which now Docker supports. These two give you enough to be deploying simplistic applications, microservices, web apps, or whatever. Then you need something more elaborate, so we introduced what we call “deployfile,” which is essentially equivalent to a normal job spec.
» Continuous Integration
Then all of that triggers CI. We’re using TeamCity, multiple official languages and frameworks, and of course you build and test and produce artifacts and all that stuff.
Enforcing standards on the way, and then feedback to the user, but also statistics for ourselves.
» Continuous Delivery
In continuous delivery, essentially, you have an existing situation, something in the service registry and then something in the routing table, and you’re deploying a new version of this. And you start testing with whatever the developer gave you, so it’s a small test of some sort, maybe something more elaborate.
But here’s something that I want you guys to pay attention to: This per-request override conceptual feature, which allows you to say for this number of requests for purposes of testing I want
foo to mean something else:
foo is the other version of
foo that I’m testing at the moment.
And not only point directly to foo, but point through something else, maybe foo is a back-end thing that you wanted to test from the front or something like that. And if this breaks, then you shortcut here, you say you failed, but then if you succeed…
What we call “activation” is a simple atomic operation that’s just a routing table update, and we do execute that smoke test again—a forced activation smoke test—and then we say we succeeded. And, if it breaks, then it’s a rollback, which is also touch the routing table again.
Feedback is just in Slack and some links to logs, and gives you links to how you could communicate with people that could help you. And then on success you get to see the thing—and we’ve got all sorts of stats out of that as well. Then, what we call “continuous deployment,” that’s essentially a term in the industry, but it’s delivery plus activation, in our case.
And again, that per-request override feature is important to us because, over the years, we had many applications that need to be tested together. It might be pretty big applications that need to be synchronized for the purpose of release, so they get to independently deliver them in an environment and then independently you could test them within some context, but then more elaborate tests can happen, and they could be activated at the same time.
Then you could do any number of such things simultaneously so you’re no longer bottlenecked in your staging environment. If you produce very quickly, this allows you to not be bottlenecked.
For the rest of the talk, I’m going to talk about operations: cluster scheduling, service registry, service discovery, and monitoring.
On premises we have essentially, whatever we have. It’s organic growth of people asking for machines and they deploy their stuff. It’s something on the order of 4,000 machines and at least that many services. And in the cloud we started doing the traditional thing of baking AMIs and one application per machine auto-scaling groups and all that.
But then we realized there was a better way to do it with cluster schedulers and not only better resource utilization but also this uniformity that we’re looking for between the on-premises and in the cloud. Then that allows us to also do faster deployments and faster auto-scaling and auto-healing. So, we have four good reasons to go with cluster schedulers.
Containers are part of that when you talk about cluster schedulers. The idea of resource isolation is, of course, important, and portability and repeatability are what containers give you, highly desired features. However, this whole thing is evolving from what we see and there’s the Open Container Initiative that is covering milestones of standardization. In the meantime, Docker is the big guy that people generally use, as much as you might like rkt or something else, Docker was looked at from a security perspective in our organization, and the runtime is a concern, and we have to be FISMA-complaint, FISMA law, not a big deal, but still while that discussion is going on we looked at other things so we could be in business in the meantime.
So, Mesos Containerizer will do that—sort of. They have their own runtime, and also Nomad with the exec driver. We chose Nomad, mainly because Nomad is relatively easy to operate compared to the others. You have a single binary, single process, and it’s easy to deal with. So, when we talk about this, specifically with the context of the exec driver, three things: the control groups, namespaces, and file systems.
For exec driver, they cover some of that. I’ll talk about CPU in a moment. They don’t cover network or devices. They don’t cover namespaces, and the file system you get a chroot’ed environment but no quotas or volumes.
For CPU, what we were not 100% happy with is there is guaranteed shares, but we wanted limitations for the applications so we could reason about their performance and so on. What we did is, we forked Nomad, and we enforced that. Ideally we would have both, so we have applications being limited, but then you let some systems jobs go beyond the threshold, which is what Kubernetes does. It may be normal that it would evolve in the meantime as well.
Volumes are another topic. Essentially, if you have external file systems, such as s3—in our case, Panasas and NetApp—you don’t copy those things, they’re large volumes, but also you lose the connection with the source if you want to update something. You don’t link them because there are no symlinks, and hardlinks would not work.
You basically can do this via mounts, which is what we did. Then, another interesting thing is, when you have
bar, supposedly isolated through chroot, the problem is that they could see each other’s secrets. If those secrets are, say, pointing to a memory location or to an external volume, they would be in the mount table. Basically you need more than just chroot, and so namespaces help you isolate. So that’s what we needed now, a fork as well.
Then, the other thing, when we say “volume support,” … what we did is label some machines and then added constraints on the job spec site. But volumes are not first-class objects in Nomad, so we would like to see that. And, so, our whole fix is not that big. We are in touch with Hashi guys to evolve the product.
Couple of other things that I can share is on the security side, TLS got fixed as of 0.6, and I think ACLs got fixed as of 0.7, just a couple of days ago. We’ll see how that works, I was gonna report that they need to be. We need something between the user and Nomad. A couple of other things on the Terraform Nomad provider, it would be nice for the Nomad provider to be checking the job spec rather than job description.
And a little bit of structure. What you see on the right is the desired structure rather than just the string, what it is now. But that’s minor stuff, just nit-picking a little bit to make sure the product evolves. Other than that, we are going forward with Nomad. We have four good reasons to be happy with it.
Exec driver does work for us for the time being. It covers use cases, it is known to scale, so it’ll scale for us, and the couple of limitations, that’s not a deal-breaker at all at this point.
Spark is something that we’re interested in, but we haven’t done research on that yet. We know that Nomad requires a special version of Spark and so, if people are working in this front, can you get in touch with us and let us know how it’s going?
Another big questions is, What do you do with databases in general with cluster schedulers? Are we there yet? Do we need a couple of years of evolution and maturity? This is also a big question, but I’m not gonna talk about that.
I’ll talk about service registry and what it means in our case. We have on the premises a very mature solution for service registry already developed over the years, in-house developed, featureful and all that. But it makes some underlying assumptions that are not true in the cloud. So we started looking to another solution, rather than re-engineering the whole thing. We were looking for products of various types that could fit our requirements, and some people built the solution on top of the ZooKeeper initiative.
Then, we looked at SkyDNS. It belongs to a set of products that basically give you enough primitives, but it’s ultimately a do-it-yourself solution. We discarded that idea. Something like Eureka is a full solution, but it’s only AWS. Then cluster schedulers give you a full solution, but you gotta use them. So, for the time being we could not consider that.
The only product that is on the market, from what we saw, is Consul, and it perfectly fit our requirements. Not only gives you service registry, but distributed health checks, the awareness of the data centers, security, and good scale. And so we went into production with it.
And in our case, what we experienced was a bit of flakiness, in particular because our connection to AWS is a direct connect that was not properly SLA’d, and so we had a period of time when we had issues. And the other thing, we didn’t have a full mesh initially. As soon as we connected all the Consul servers, all the flakiness disappeared.
Then, there’s another way to do it, to just use the enterprise version, where they compensate for not having a full mesh by doing hub and spoke functionality.
Then, in terms of scaling, what we did is touch the Raft multiplier, but also use stale consistency mode as much as possible, and it can be used in most of the cases.
ACLs are another way to limit the amount of noise, basically data transfer and events, only clusters can be scoped by ACLs. The enterprise version has the non-voting servers, so you can scale the servers linearly and that’s another way to scale it. We’re not using those yet, but it’s already pretty stable.
Consul is the brain of the whole operation, it’s where the status of things is and has been a remarkably stable component among other components in our infrastructure that we developed in DevOps. So we’re happy with that. We only saw a couple of fire situations, which were entirely due to our mistakes as operators. We relatively quickly solved them—within 20 minutes or so.
Service discovery is the other topic, which we consider something different from service registry. The way we think about it is, it’s a layer on top, and in terms of our requirements again, that pre-request override that I talked about earlier, the traffic shifting, and then the whole legacy world and the new world had to be done.
When you think about the whole effort, there’s like two ways to do it. You could basically migrate everything, which is unrealistic, or you could essentially create a solution of some sort to have a hybrid and then gradually move things over. And the solution for us has been a product called Linkerd, by Buoyant. That basically covers all those use cases.
In terms of what the product does, maybe people have heard of the term service mesh, it has been talked about for the last year or so. Can we raise hands? Service mesh awareness? Okay.
There’s Linkerd on one hand, there’s Envoy on the other hand. How many people have heard of Envoy? How many people Linkerd? And maybe people are using that. There’s also Istio. There are also some other proxies such as Traffic and Fabio, that may evolve into this service mesh thing.
In general, the way to describe the service mesh is essentially foo needs to talk to bar, you have some application code, and you have some network to worry about. Then historically we have things like flow control, which moved to the TCP/IP stack. So, now we have SOA and microservices, you have things to worry about such as circuit breaking and retries, and so it makes sense to push this down at least to a level of a library.
But some people argue all the way outside and make it part of the stack. And there are reasons for that. If it’s a library, you gotta maintain it, you gotta push updates onto applications. You gotta do this for every language they have.
Current solution is to generally have this as this like a sidecar to the application or an agent on the machine. And this is how they depict it. Then you get to have a single pane of visibility onto the whole service mesh. You could do distributed tracing this way. If you don’t have a solution for that, it’s an excellent solution. If you do have, like we do, it’ll complement your solution and enrich it.
Then, those two use cases for per-request override and switching the routing can be done by the service mesh. The only product that does this right now—the pre-request override—is Linkerd. So that’s what we went with in production, and here’s our setup:
Linkerd is a proxy, but it also has a name resolver, which could be part of the proxy or could be externalized. And the name resolver is what talks to your service registry—or service registries: it can have multiple back ends. Then, when foo needs to talk to bar, it goes through the proxy, the proxy will ask the namer, the namer will talk to your service registry, and now the proxy can proxy. But then, any subsequent requests for bar, Linkerd isn’t going to ask because it already knows.
The updates from the service registry, such as back ends failing, service health checks, are going to be propagated as knowledge all the way to Linkerd. But Linkerd has its own knowledge, because it does the proxying, so it knows which back ends are slower than the others and so on. And it could do all that circuit breaking for you.
The beauty of this is that you could hook multiple service registries, and so we hooked our legacy system, service-discovery system, and now we can talk to legacy applications. Basically anything can talk to anything. A new application can talk to both new and old, and if it’s an old application that is using one of the supported protocols, then everything is basically covered out of the box. Then there’s another mode, because Linkerd is an application-level proxy—there is Linkerd TCP as well, which is L3 level. It is also evolving and written in Rust.
Maybe this whole thing will evolve further, but you could do a lookup, followed by a service call, and that happens to be the model that our legacy world had. So a legacy application would basically do a lookup in the legacy system. What we did is, we took that library for a couple of languages only, and we provided another back-end for namerd, and now we just rebuild the application and you’re in business. Essentially, this covers 100% of our applications.
So there’s zero-cost migration from an old world to a new world. And now, of course, we depend on Linkerd, but that’s for us to worry about as operators of that. The way we did HTTP is essentially, foo needs to talk to bar, there’s a URL bar.Linkerd.ncbi or whatever the URL would be. Some sort of scheme that Linkerd now understands and knows that this is not google.com, so no need to resolve it by DNS but by service discovery. And the way we do that is we wrap the applications when they go through CI/CD, they are wrapped with environmental variables that basically HTTP proxy and non-proxy, those are de-facto standard environmental variables that all the major HTTP libraries are honoring. And this means that the applications don’t even need to know that they are participating in this at all—this service mesh thing.
And the whole thing is recursive, all the way down. There’s another model for it, Linkerd to Linkerd, which gives you certain benefits, that you could upgrade protocols for example or encryption, maybe with Vault. That’s something that you could do by just registering essentially the Linkers rather than the applications themselves in the service registry.
This gives you this recursive architecture, starting from the user web request. They discover the edge by DNS obviously, but then it’s all the way down to—maybe it’s Mongo, maybe it’s Solr, maybe it’s whatever other database. In the case of Solr, it’s HTTP, so you could still apply this. In case of, say, SQL Server, what we do at the moment is to basically look up and then follow by a service call. This call, per-request override, also works within this complex chain of calls and lookups, just because namerd and Linkerd understand the overrides.
The overrides, it’s essentially a delegation table, local—DTAB is what it’s called—that you pass around as a context, and that’s a feature, Finagle, that has been battle-tested in Twitter, and that’s how they do it, but it could be done in other ways. The way that we introduce this is we abstract away all of that. The technology underneath, you can have multiple solutions, but we hopefully will be still in business.
We’ve operated Linkerd for about a year, reached a scale of about 5,000 requests per second at the peaks and about 1,000 services both new and legacy. It gives us a lot of metrics. And that brings me to monitoring, which is the final topic of this case study.
We monitor every piece of infrastructure, including monitoring itself. AppLog is an old system that we had developed over the years, which started with application login, but then grew into a number of other things, such as even analytics and stats and whatnot. It is quite comprehensive. At the same time, you look at Google Analytics and it’s a little bit better in some certain things, so we started complementing this system with other things, with other products. In case of analytics, Google. In the case of time-series data, something like InfluxDB is much better. It’s just the proper storage for such data.
And what we like about the TICK Stack in particular is that it’s nicely decoupled, and it has all the nice interfaces between the pieces, so you can substitute pieces. We are using Grafana, for example, rather than Chronograf, which is evolving at the moment. Then we’re happy with tools like Telegraf, which really has a lot of plugins, more than 100, and you can do things like gather Consul stats, but if you wanna get metrics about the services, for example, you could do Consul Exporter, which basically clears the API of Consul and exports things in Prometheus format. Same for Nomad and similarly for Linkerd, which gives you a number of options. What we’re using is the InfluxDB line protocol API to use Telegraf. And, say, for namerd, for the apps, we told people you could do a StatsD or Xpulse Prometheus endpoint and then for any custom metrics that you want.
In terms of dashboards, we generate all these dashboards and we think of dashboards as a service. If you have an application that goes through CI/CD, we’ll get that dashboard. But not everything can go through CI/CD immediately, so if you want that dashboard, declare your service and its dependencies and its SLA, and we’ll give you a dashboard. It’s an incentive to extract the dependency graph on one hand declaratively. On the other hand we’re looking at distributed tracing and we’re pushing people to log in, so we get a full picture by combining the two methods of declaration and observability.
Another piece of the TICK Stack that I want to talk about is Kapacitor, which gives you serial processing but also batch processing. It can go and fetch batches from InfluxDB, for example. It’ll alert you when there’s no signal but also has a lot of built-in actions, and you can do your own. It can do a number of things for you, including auto-scaling services, especially given that we have the service mesh. We have a lot of things like requests per second, the speed of change of requests per second, latency error rates, and things like that. It’s a pretty good set of metrics, much better than just CPU and memory, which essentially we think of as a derivative metric—you get CPU because of something else. So that’s important to be able to feed those metrics into whatever is going to act on the system for the purpose of auto-scaling.
You could do the same for cluster capacity, and there are tools like Replicator, which people were talking about yesterday; that’s another solution. They would probably have to evolve it a little bit to be able to feed it with metrics, rather than just CPU and memory, but it’s another alternative, obviously. Then auto-healing can be done the same way. Say an application back end is for some reason not in a good shape, you just kill it based on metrics and then let the scheduler deploy another one. It looks like things are converging into a solution around that.
The AWS application load balancer is doing such things nowadays, and Nomad as of the last version, 0.6, also has something on that. The point I’m making is that the service mesh would give you these additional metrics that you could fit into the system and do the auto-healing in a way that’s a little bit more intelligent.
At the end, there’s all this auto-healing, but you’ve gotta involve the humans when things fail: alerts, create incidents in your incident tracking system. Then we could talk about SLAs finally, and it’s basically the proper way of doing these promises about quality, and you could define your tiers and talk about multiple 9s and so on.
So that’s our story of how we decided to introduce all this quality between the services and introduce SLA and contracts between the microservices within not only the general core services of NCBI.
If any of this is interesting, get in touch with us. Don’t hesitate.