Hear how Tide abandoned its adoption of AWS AppMesh in favor of HashiCorp Consul, making the transition in only 6 weeks with no downtime and no big-bang migration.
After a long, difficult, and ultimately abandoned journey with AWS AppMesh, Tide adopted HashiCorp Consul this year within six weeks, across both ECS-EC2 and ECS-Fargate. This is the story of how they did it with no downtime and no big-bang migration. Along the way, they found ways to…
Make it easy for the engineers to define their own intentions and configure their upstreams in code
Solve issues with misaligned health checks
Explore options for developer-ingress to the mesh
Speaker: Jez Halford, Head of Cloud Engineering at Tide Business Banking
Hi, my name’s Jez. I’m going to talk to you about Tide’s self-service service mesh with HashiCorp Consul. First thing to get out of the way is we are Tide, we’re not Tide. We do a modern business current account. It gives time back to people who work for themselves. We’ve got about 350,000 members in the UK. That’s about 6% of the market. Launching soon in India. Check us out. We’re also hiring. That’s who we are, nothing to do with washing.
Behind the scenes running all that stuff, we’ve got about 250 engineers, 120 services across around 30 teams. We deploy on average 15 times a day. We’ve got some scale, some complexity. We’re not the biggest. We’re not the smallest. There’s some stuff going on — it’s chunky. We run Amazon ECS for all of that. ECS backed by EC2 instances — we run on Fargate.
You can see a few interesting things there. We have Gravitee, which is a third-party API gateway that works nicely for us. We have a bunch of microservices. We have Kafka in the mix and some RDS databases. And it’s a setup we like, it’s nice.
Well, this picture is a lie. Call it an abstraction because the picture is more like this. We have this internal Amazon ALB, an application load balancer, and that’s how all our services talk to each other. They all go from the service to the ALB, back to the other service. And that’s okay. It’s something that we did a couple of years ago when we had eight services, and it makes a whole lot more sense in that picture.
In this picture, it starts to get a bit untidy. We also don’t have any control over which services can talk to which other ones. If they can access the load balancer, they can talk to whatever. We wanted to fix that. We wanted to make that better from a security perspective.
We wanted to make that better from a reliability perspective as well because although that’s an Amazon-managed ALB — they’re sturdy — it’s still a single point of failure. We needed a service mesh, and we didn’t have one because when we started with ECS, there wasn’t one available.
We spent a long time looking at Amazon AppMesh because that’s the obvious choice if you’re using ECS.
We found that it had a few problems. I think it’s a fine choice if you’re starting from scratch with ECS today. But because our ECS environment is a little bit older, it has some assumptions built-in — and the way things work, it’s hard to retrofit AppMesh into the picture. And there are a few call-outs as to why.
Firstly, ingress. That Gravitee API Gateway product I talked about is another service in our cluster and AppMesh doesn’t have good support for that. Obviously, they want you to use something like Amazon API Gateway. Services within AppMesh have a limit on their upstreams, the number of other services they can talk to. And that limit is ten.
It’s an AWS service limit. You can apply to have it increased, but ten is very low. We have 120 services. Our API gateway needs to talk to most of them. Ten doesn’t work. And I think with Amazon service limits, they’re hinting at intended infrastructure. They’re saying this is ten, so if you need 200, you’re doing something wrong.
That’s a conversation we had with Amazon support, and they were like, “Oh yeah, it doesn’t work.” We couldn’t find a solution to that. With Consul, it’s not a problem — there’s no limit in that way. But there’s also an ingress model that’s better — that I’ll talk about in a bit.
mTLS as well. We wanted mTLS communication between all our services. You can do that with AppMesh out of the box; that’s fine, but you need an AWS-managed private CA.
And I’m stubborn about AWS’s CA’s because they’re so weirdly expensive; they’re at $400 a month. It’s not a lot of money, but it’s also quite a lot of money. I don’t understand that pricing. It bothers me because we’d need one per environment and running in multiple countries that doubles up again. It seems odd — I don’t understand that product. It bothers me.
With Consul, it’s not a problem because the TLS certificates are issued internally. Nobody has to worry about it — it uses Consul magic. Another thing we discovered with AppMesh — ultimately we want to lockdown global egress. We don’t want to have services going, “Hey internet. I want to talk to you.” You can do that AppMesh — that’s a design goal of the mesh. But you have to have all or nothing. You have to have every service the same.
So every service has to be using mTLS, and then you lock that down. That’s very hard when you already have an installed base of services in an existing cluster. That was difficult for us to get our heads around. With Consul, again, it’s a little bit more flexible. You can do it on a per-service level. You can tweak things. It gives you a path from here to there, and you can start to play around with enabling that for some services and not others. Again, the flexibility is what we needed.
We also have a few services that exposed more than one port. Maybe we’re doing it wrong. I don’t know. I think there’s a good argument that says if you’re exposing more than one port, then you secretly have more than one service. But that’s not the story we have. We have both gRPC and HTTP. We have some that have a management port. It’s the situation we’re in. There’s no solution for this in AppMesh.
We were looking at: do we need to split our services? That puts a lot of engineering overhead on our teams. So, there’s all those teams. We don’t want to have those difficult conversations. This was a real stumbling block for us. And again, with Consul, this isn’t a problem because the flexibility is there.
Finally — and this is an interesting one — AppMesh doesn’t have a good story around developer ingress. By developer ingress, I mean I have a laptop, I’m developing something, and I want to go and poke a service that’s internal within the mesh that wouldn’t necessarily be publicly routable. It’s not necessarily connected directly to our API Gateway, but I want to poke it. I want to do stuff — to debug it.
AppMesh has no story there, but Consul didn’t have a story either. But we spoke to Amazon about it, and they were like, “Oh yeah I don’t know, maybe an ingress proxy per service.” But that’s expensive, and doubles that compute. With Consul, we figured out a way. We haven’t implemented it yet, but I’m going to talk about it because I think it’s interesting. But again, that flexibility gives us options. That was how we went through. We tried AppMesh, and we got through that process, and then we were like, ah, maybe Consul.
As well as that stuff, we had a couple of key drivers. Firstly, the ready-made control plane thing. We use HCP — the HashiCorp Cloud Platform — that gives you a Consul control plane straight out of the box.
It’s similar with AppMesh. AppMesh’s control plane uses the Amazon magic — you don’t have to think about it. We didn’t want to have anything more complicated than that, and HCP seemed like a good fit there. As I hinted, we needed it to be easy to configure for our engineers. We needed them to be able to set up their services, to use this quite easily. That was important for us.
Intentions — which are the things that control which services can talk to which other services — are something that we want fine-grain control over. We wanted to find a separate way of governing those. I talked about API Gateways. I talked about Gravitee. Compatibility with Gravitee was obviously an important thing to address — and developer ingress. Then finally getting Consul up and running without any significant downtime — ideally, without any downtime, that was our aim.
So this is the picture we wanted to end up with. We wanted to end up with HCP peered into our environment, acting as our control plane and all our services talking to each other through Consul sidecars that are marshaling that connectivity. It’s a good picture. I like it. How are we going to get there?
This is the picture. I love this picture because it’s full of arrows. You’ve got HCP at the top, VPC peered in, that’s easy. Get your credit card out. They give you a peering ID. Job done. Then you’ve got some options. You can see here I’ve got two examples of ECS tasks. And an ECS task, you can think of it as roughly analogous to a Kubernetes Pod if you prefer that world.
So it’s a private network space which your task runs in. It has a service container that is doing the meat of the work. You have two sidecars, Consul Agent and Envoy. You can see the Consul Agents all talk to each other and figure out what’s going on and like, “Hey, what are you allowed to do? What am I allowed to do?” And then Envoy is what’s doing the traffic — doing mTLS.
There are other options for this. If you’re using ECS — which we are — based on EC2 instances, then you can run the Consul Agent as a daemon on each host. Instead of having an agent per service as a sidecar, you have an agent per container host, per machine. We kicked that around. We didn’t like it as much because the health checks, the arrows you can see in red here — the Envoy stuff — are un-encrypted. That would meet un-encrypted traffic going outside the task boundary.
I don’t know if that’s a problem. It seemed wrong. Maybe having a Consul Agent per host is a little more resource-efficient. I don’t know, there’s an interesting trade-off, but this is the picture we ended up with. I like the picture. I think it works. From a services perspective, all those complicated areas disappear, and you end up with a bunch of upstreams on some local port numbers — arbitrary numbers. In this case, I’ve chosen 9001 and 9002. Your services that your particular service depends on are just there.
I love this picture because you could easily recreate that environment anywhere else. You don’t need Consul to do that. I could recreate that on a development machine. I just expose a port locally. There’s my service. I could recreate that in a completely different orchestration layer. I could recreate that on bare metal. It doesn’t matter. It makes our services that little bit more portable because the environment that they’re in is straightforward. There’s no dependency here on that load balancer on route 53 entries for service discovery. It’s all there and that’s nice. So that’s the control plane. That’s how it all hangs together at a high level.
Once we had that picture we needed to configure it — we needed to roll it out. This is roughly how we deploy services right now at Tide. We have a repo, and the repo is the services source code. We have a Dockerfile, and a Manifest, which is a bunch of YAML that describes how that service is supposed to work.
When you hit deploy, there’s a
docker build step — push that up to a registry. Then there’s a deployment step that parses that manifest file and spins it out into a cloud formation template. The cloud formation template describes firstly the DCS task definition — think of that as a bit like a Helm chart. It also describes any dependent infrastructure — S3 Buckets are a classic example.
So that’s how it works. It works very nicely. This is an example of that manifest. You can see it defines health checks, defines ports, CPU, memory. The obvious thing is to add upstreams into that. Then there are a couple of lines there for engineers to add to those manifests.
They can define, “OK my service needs to talk to this thing and this thing, and it’s going to use these ports.” Again, the port numbers are arbitrary. It doesn’t matter. They just all get exposed locally. We think that’s quite nice. That’s a three-line change and you can use your service mesh — but we could maybe do better than that. It gets parsed — deploy time again — and the upstream config finds its way into there. It goes into the task definition as an environment variable that’s visible to the Consul Agent. Happy days, but maybe we could do it nicer.
Instead of this, if you make it slightly more complicated, you can do something interesting like this. This is a Python application that sits in our deployment pipeline and parses this stuff. We allow developers to specify an environment variable that the service address will be available in. We use this magic string HTTP URL, and that gives us this picture. Instead of having to worry about injecting your config to a localhost port. It’s just here in an environment variable. And because a lot of our services use Spring Boot, if you use a Spring Boot property that’s named to match your environment variable, then that Spring Boot property is automatically populated.
And now, all you have to do is that one change to your manifest and your Spring Boot properties automatically contain the address of your dependent service. That’s easier, that’s nice. That’s genuinely a very small change, and everybody’s using the mesh.
This property gets automatically populated, as I said. Then we can have fun with this. You can define a few different options. You’ve got HTTP as I mentioned You’ve got GRPC, TCP if you’re protocol-less. List which is useful for injecting into the HAProxy config. There’s also a template which will drop the port number in some way — magic. You can define a boilerplate parse for APIs. That’s quite nice. Nice and flexible, you can do whatever you like with it. It matches up to the Spring Boot config and it takes some of the overhead off of engineers who were adopting this. That, I like that. I think that’s fun.
Intentions, as I said, are important because they control what services are allowed to talk to what other services. Here’s a bunch of services doing their thing, talking to each other.
If I do this without an intention, that’s not going to work. That’s going to get blocked. I have to put an intention in, which is a configuration within the control plane to enable that traffic to happen. That’s quite powerful. The primary reason we wanted a service mesh was to control this stuff. We have this flow where we’re deploying stuff. We define these upstreams, and that gets peer-reviewed by pull requests, then it gets deployed.
One option would be to have that piece of configuration drive the intentions. If I say I’d like to talk to my target service, then we create an intention for that piece of connectivity and then it’s zero additional effort. That’s cool, but it’s also scary because security. So we ended up with this picture where intentions are configured separately — subject to a separate pull request which we peer at very carefully and then deployed.
You can see the configuration looks like this
allow_from structure here. That is all in a single repo. All our intentions are in one repo. It has a lot of controls over it. It has a lot of security folks peering at it. Each service has a YAML file, and that YAML file defines what services are allowed to talk to the particular service the file is for. That mirrors the upstreams. The upstreams are, “I would like to talk to X.” And the intentions — the
allow_from construct — is, “I will allow Y to talk to me.” And you can play around with that. There are a lot of options there. We thought about if you can have
allow_from, you can have
There’s a precedence order, and it starts to get very complicated if you use more of the different flavors of allow and deny. We stuck with
allow_from. We might do
deny_to for a few particularly important services or where we have particular concerns. But I think having this mirror the upstreams is a nice structure for the general case.
It gets deployed with Cloud Formation with a custom stack, custom provider. You can do this with Terraform. There’s a Terraform provider for it. We use CloudFormation to deploy our services, so it made sense for us — but by all means, use Terraform. We do use Terraform for our underlying infrastructure. We just don’t use it for services. But it was easy to do either way. You go behind the scenes poking the Consul API.
Next thing was Gravitee. We like Gravitee a lot. It’s flexible. It has a plugin architecture, so our auth fits into it. You can do interesting things with SSO through it. You can do all kinds of fun with it. It’s flexible, powerful and we’re very happy with it. There’s a graphical interface that engineers use where they configure their routes.
Obviously, we could have defined it as another service in the mesh and have it define upstreams for all the services that it wanted to talk to. But that would be a really long set of upstreams and a code change for every deployment of a new service.
We’d have to make that code change and configure the routes through the graphical interface. And again, it seemed clumsy. We have all these engineers, and they’re very busy — and they’re very expensive — and I didn’t want to waste their time with it.
There’s another option which is roughly this. Our API gateway is outside the mesh. There’s an ALB in front of it where traffic comes in from the internet. That’s just for high availability — load balancing between the different container instances. Then we have another ALB. Again, high availability, low bouncing between different instances of a Consul ingress gateway. This is Consul running in another mode. It doesn’t have a primary container. It’s just being an ingress gateway. It’s allowing traffic to whatever.
And then simply any service that needs to be addressable from the API gateway has in its intentions file an
allow_from-Gravitee mesh-ingress. You can see that. And that allows the ingress gateway to talk to the service — Gravitee just talks to the ingress gateway.
It’s nice, and it’s very flexible. It means that to make a service publicly accessible from the API gateway, you can configure the route in Gravitee’s graphical management interface — and you configure the intention within your own services. Intention to file, job done. We were pleased with that picture too.
I talked about this, and I said, this is experimental. We haven’t done this yet, but we have a good plan. We talked about it a lot with HashiCorp while we were setting things up because we wanted to make sure we had a good story. We didn’t want to get to the point where the mesh was all ready to go and then all of a sudden the people can’t work. That’s important to us.
It looks like this. We’ve got a developer that’s using their laptop, and they want to connect to some arbitrary service within the mesh. Not necessarily one that’s addressable from the ingress gateway. It’s just something in the middle that they need to poke for whatever reason.
So if everything’s locked down with intentions and the mTLS, how do you do that? Well, we think it looks something like this. Firstly we have a pool of agents within the mesh that are there for developer purposes.
They’re in the mesh because they use this gossip protocol to stay in touch with their peers. The gossip protocol is a little bit sensitive around latency. So, if it was talking over the connectivity from a developer’s laptop — obviously we’re all working from home nowadays — there might be somebody’s on a train, somebody’s in a cafe. Their connectivity is vague, and we didn’t want to have that cause an impact on the wider cluster. So we keep those within the ECS cluster. Then, on the developer’s laptop, we have a Docker container that’s specific to this purpose.
The Envoy proxy is on there, and there is a configuration file much like that upstream config file that we already saw — much like that manifest You configure that file. You say today I’m talking to the X service and the Y service, so I’ll map those to this port and this port. You boot up the container, it joins the mesh, and everybody’s happy. We think that works well. It creates that same environment that I talked about where the service — in this case, the service under development — has a local port, and it talks to the service that it needs over that port.
You don’t have to have a separate configuration file for local work versus development environment versus a production environment. It’s all there on the same port, and it works. It’s nice and I think that’s a good picture. We’re going to roll it out soon, and we’ll see how it works. If other people have solved this in other ways, it fascinates me, this problem. Because as I said, there’s not an obvious story. Nobody’s documented this anywhere that I can see. I think this is interesting, but I think that picture is one that we’re happy with.
The final step is rolling it out. How do we do it without any downtime? We’re going to go from this messy picture A to this tidy picture B — and we’re going to do that in a few steps.
We’ve already done this one: We add the agent deployment to our existing deployment pipeline. After a while everything gets redeployed — we now have an agent running as a sidecar next to every service. It’s not in use yet. It’s there doing its nice agent-y thing.
Then we asked the developers to enter that upstream’s configuration in each service’s manifest and define those intentions. We’re not yet going to enforce those intentions. We define them.
Then after a while, there is a degree of saturation, and we can monitor that internal ALB that I talked about. You get good analytics from CloudWatch. You can see what’s going through it, what’s not. You can see per target group. And we see each target group which references each service dropping to zero as services get deployed using the mesh. The traffic through the ALB goes to zero. That’s the stage we’re at now. That was where we got to.
The last pieces are tidying up — mopping up. That will be once the internal load balancer traffic reaches zero. Confirm all the mesh traffic is aligned with intentions. We’re not entirely sure how are we going to do that, but I think it’s okay. Enforce those intentions, and then we’re done. There we go, we’ve adopted Consul.
It did take us about six weeks to get to point three. But I think that’s because we have this large number of services and this large number of teams. There’s a long tail of services that need to get caught up as part of our usual maintenance cycles. If we hurried, we could probably do it quickly, but it fits in with all our other priorities and gets deployed as part of the natural cycle of development.
That leads us to a point where we’re ready to switch on the intentions enforcement, and the Mesh is up and running, and everybody’s happy. That was our journey with Consul. I hope it’s been helpful. Thank you very much.