Learn Vodafone UK's approach to enabling self-service observability for all developers while also ensuring the platform is always monitored and SLO's are implemented.
Speaker: Llywelyn Griffith-Swain
Hi, my name is Llywelyn Griffith-Swain. I'm the SRE Manager at Vodafone UK, and I look after the release, resiliency, and performance-engineering teams. Within Vodafone Digital, or Digital Engineering UK, we look after anything you've accessed through the internet — whether that’s the MyVodafone app, chatbots, or any of the websites. I'm here today to talk to you about observability as code and how we achieve this with Terraform.
Before we get started, I wanted to talk a bit about the road to observability as code — or OaC as we call it. Vodafone itself — we're on a journey from being a traditional telco into a TechCo. That's a technology communications company with a big emphasis on software engineering.
In 2018, we started our DevOps journey — this was a lot of insourcing, and we rapidly grew in scale and size as we were able to deliver and meet the demands of the business. In 2019, our site reliability engineering (SRE) journey started. We were fortunate enough that this came organically, as we had reached a point — where to accelerate the rate at which we could deliver — we needed to have SRE in place due to the scale we were working at.
I was fortunate to be one of the first four SREs. So I helped to build lots of the initial implementations we're going to be talking about today — but the team has taken it one step further. It’s also worth noting that in 2020 we went live with our full infrastructure as code/zero touch environments. This is where everything is done through a CI/CD pipeline. This is where the need for observability as code came from — because if we're treating all our environments and applications as code, we need to treat our observability exactly the same. We believe we have a very good vehicle for software delivery through these three things, and today I'm going to talk about how we monitor that.
In 2018, when we started, we started insourcing and got quite a lot of developers. We had a centralized monitoring team that was looking after monitors, dashboards, and alerting.
This was fine to start with. But as we grew and added more developers, this team became a bottleneck because we would have to wait for the monitoring and alerting to be provisioned before our projects could go live. To add a further challenge to this — as we were on the DevOps journey and all about encouraging the ownership of the developers code end-to-end — we had a separate production team that would take those alerts. So, we needed a way to implement monitoring and alerting to try to increase our efficiency.
When we started, the challenge was we had about 150+ developers. We have a lot more now, but to start with, we had about 150. We needed the solution to encourage ownership so that they could truly own their code end-to-end. And as a part of this, we also needed to give insight into production status at any moment in time. Finally, being an SRE team, the solution needed to follow SRE principles.
As I mentioned before, we had a very small team to implement this. By following SRE principles, the solution needed to be automated. We needed to enable the developers to have ownership end-to-end and remove that bottleneck — which meant that it had to be self-service. The solution needed to be simple, enable the devs to have control, and be repeatable and written as code.
To start, we had Microsoft Azure DevOps as our CI/CD. All our repositories, releases, teams, and users were stored in Azure DevOps. We’d just started using PagerDuty. At the time, it was a lot of manual configuration, but we’d only just started the SRE team and only just started using the tool.
The same can be said for our monitoring tool, which was Datadog. When we had the challenge and looked at the toolsets we had, we were very fortunate that they both had a Terraform provider that we could then use and implement. Finally — just to note — all our actual infrastructure and applications were running on Amazon Web Services (AWS)., All of these things combined gave us a very good starting point.
We wanted to start small. It's all about making the small changes as quickly as possible. To start, we wanted to do synthetic API tests. These were our first choice because they were the easiest to do — because all we needed was the endpoint URL for us to monitor and understand the response code. We just had to build the monitors.
Once we did this, we moved forward and accelerated into building a 50,000-foot view of production. This is essentially our incoming traffic — anything through our CDN, through the frontend, the backend layers, and also into any of the other content stuff that we had in the background. That was great because it meant if there were any issues — at any moment in time —we could look in one place and identify exactly where the challenge was.
Having a 50,000-foot view is great, but we also needed to understand specifically what was going on with that service. For this, and for the 50,000-foot view, we followed the RED way of monitoring services: rate, error, duration — or latency. We're looking at the number of requests coming in, the error account, and the duration that's being faced. The idea is by having these three things, you have a good indication of how your service is performing.
We had the service-specific dashboards, but we also needed to have monitors because we don't want people to sit there watching screens to see what's changed on the dashboard. We need it to actively call them out. As a part of creating those monitors, we also needed to place all the development teams on call so that if there was an issue they wouldn't have to look at a screen. They would get paged straight away and be ready and available to fix the issue. We went away, and we built that, and we learned a few lessons along the way.
On the left-hand side of the screen, you can see I've got some code here. We were using Terraform 0.11 at this point, so if you see
count.index, that's why. Nevertheless, it's still a very good example. You can see we have a synthetic test that's being created, and we have a variable that's called
endpoints. Essentially, this is just the monitor to see is this website up? So are vodafone.co.uk, vodafone-trade-in, or register-your-interest up and available? With that, we've also got
name — so here we've got the three names of what those synthetics were.
Because these weren't produced in variable blocks — although we had a module resource to call — it became a nightmare to manage. Although it’s just two variables here, there are over ten in the example on the left-hand side, which meant that we were in the position of managing multiple lists of variables — which was an absolute nightmare.
The way out of this was to build modular with variable blocks.—I've just changed that same code that you just saw to this — the same thing but looking at it in a variable block. It's not as pretty to look at but the variables then turned into these blocks which make it much more human-readable — and much easier to manage.
We're using Terraform 0.11. Here on the left-hand side, we have a module or a resource to build a synthetic API test in Datadog. As you can see, the way
count.index worked with Terraform 0.11 was that we would assign a count. We would look into a variable and count how many times it appeared.
In this case, we're looking at the variable
endpoints and we're counting to see, "I need to provision three synthetics —or three monitors." Monitor one would be vodafone.co.uk, monitor two would be trade-in, monitor three would be register-your-interest.
I would then have a name that would then be associated with each of those, which you can see here. And it's the same thing. Vodafone UK would be called
Homepage, trade-in would be called Trade-In, and the URL register-your-interest would be Register Your Interest.
The challenge was that this is great as it works, but the difficulty came when you wanted to change anything. The way that Terraform used to associate where a URL would appear in its index or where a monitor would appear in the state — it would assign a numeric value, and these values didn't change. For instance, if we decided we no longer wanted to do trade-in or the URL changed, but we didn't replace exactly where that URL sat, Terraform would run, it would say, "I need to delete this monitor, which is the trade-in one. We'll delete this URL and replace it with the next one, which is Register Your Interest".
This meant all the history that I have on that synthetic monitor is looking at something else. It meant that we could no longer trust what we were doing. This was quite a big cause of toil for us before Terraform 0.12 came out.
Traditionally, I would have said we want to be on the most mature, stable, tested version. But by upgrading to 0.12, we were able to use
for_each loops, which gets away from the challenge we had about numerical indexes being assigned. Instead, there's a dictionary value — so the name of the synthetic. This meant the issue would never happen again. Those are two key lessons.
We had Terraform calling a Python script. That Python script would go into Azure DevOps. I mentioned earlier that we had our CI/CD, but all of our teams, users, and the services they owned were also stored there. That will make an API call to Azure DevOps and pull out all that information.
Next, the Python script would also go to AWS and pull all of our running Amazon ECS tasks, so then we had an indication of what was running in that given environment. Next, all that data would be input into Terraform. It would use the data from Azure DevOps, with the teams, users, and services to provision teams, users, and escalation policies within PagerDuty. So straight away, it gave us a vehicle to put all our developers on-call, but in a completely automated fashion with nothing being done manually.
Taking all the information from AWS and ECS, we then fed that into Datadog through Terraform. It enabled us to provision monitors, dashboards, and eventually, things like — as we saw — synthetics, APM, etc. This was great. Originally when we achieved this, we needed to train the developers to be on-call and make sure they were comfortable with everything we built.
It was a blessing in disguise that we did that because it meant that we had to sit with the solution calling us out for a couple of weeks. And we found out our state file had become huge. We were provisioning over 150 developers, 100 services, monitors, dashboards, different kinds of performance metrics that we were doing. It would take us 17 minutes to run Terraform.
That wasn't ideal because it could significantly delay the rates to which we could deliver. It ties in quite nicely to lesson three, which is to split your state. We split our state file into the PagerDuty users, dashboards, API tests, and monitors. This meant Terraform could run much faster, and we'd also increase the rate at which we could deliver and wouldn't be a bottleneck. That was how we provisioned total visibility across our entire estate. However, that just was calling us out.
We went forward with this by asking how do we share it with everyone? The way it worked is that we had us — as the SRE team — we were developing Terraform modules for PagerDuty and Datadog. And the idea is we would run Terraform and it would provision those services. But the real beauty was that the developers could call those services.
You can see a little snippet there on the right-hand side, but all we're doing there is creating a synthetic, a duration monitor, and a PagerDuty schedule, all through code. Everything you see being declared is just the variables because the modules are already built, and all they're just pulling down those modules from our S3 bucket, running Terraform, and it's inputting those variables.
Originally when this started, the idea — and the vehicle to deliver new enhancements and requests or features — was they would come to us directly as the SRE team. What happened next surprised all of us, but it was absolutely something that was welcomed.
We saw if there were areas we hadn't delivered something — or they were using a new technology or language — developers started submitting PRs to us. They would submit PRs to build the modules, we would work with them, approve it, put it in that S3 bucket and then straight away, any other developer can call it. We are making it self-service and also making it easy for them to input and build into that.
That was the original solution we built. It was absolutely great and game-changing for us. It removed the bottleneck and challenge — and enabled the developers to develop as fast as they wanted to go. We managed to hit what we wanted to do — we achieved our objective — but we didn't stop there.
We wanted API tests, 50,000-foot view, RED dashboards, and placing development teams on-call. It had to be self-service. It had to be automated — and we achieved all of that.
We've continued on with synthetic browser tests. It’s now able to programmatically create bots that conduct customer user journeys throughout the website. This helps us because we can have that as a release blocker, which will mean that you must have a passing synthetic test for you to deploy to the next environment.
This became very valuable when we started doing destroy-and-deploy environments. If you wanted a new performance environment, you could spin it up. But we would validate the health by running every customer journey. Having this through Terraform meant that we could run it straight away without having to program anything manually.
Finally, by implementing Datadog and collecting as many metrics as possible, we could generate SLOs off everything we did. But even better — taking that one step further — we were then able to automatically create those SLOs through Terraform — again. But it came to the point where we are now running Terraform at every release, and every single development team has their state stored in S3. This has been game-changing for us, and it's enabled us to start treating all environments the same.
We've found we now have a time-to-production of less than four hours. This is significantly reducing with every week or month that goes past. But last time we took this cut, four hours was our time-to-production.
We have our best ever digital availability of 99.9%. That's due to the level of insight that we've had and the capability to ensure that the right people are looking at the incident in the shortest space of time. This is then reflected by our MTTA — or Mean Time To Acknowledge an Incident. When an issue happens, a call goes out to a developer. How quickly do we acknowledge that? That is currently, on average, less than a minute.
Then we have MTTR — our Mean Time To Repair. When we have an issue, this is how long before we know that the incident is over, and services restored. This is now an average of 30 minutes. We didn't have these metrics before, but hopefully, you can appreciate these are things we're very proud of. I feel this has a significant benefit to our developers, our business, and our customers.
From a business perspective, that's great, but what about the developers? We've enabled developers to have self-serve monitoring and alerting. They can develop whatever they want, whenever they want, safely, which will mean that even if monitoring and alerting was forgotten about, it would be automatically applied because we're running Terraform at every single deployment.
Developers have full ownership of their code, and they also have automated insights through the software delivery lifecycle. What does this mean? Well, let's say we hit an issue in a dev environment. If they want to, they can build a monitor to make sure they can capture that if it happens ever again.
But that monitor wouldn't have to be in the dev environment. They could also run that for the performance environment, SIT (system integration testing) environments, and also production. So straight away, they've had an issue, they've hit it, but they can be made aware of it if it happens anywhere else throughout the development lifecycle.
Finally, what about SREs? The benefits to our SREs, or to the teams, is that using Terraform has enabled us to standardize across our environments. This can be seen more recently. I started looking after our performance testing — and as a part of that, we needed to put the same monitoring and alerting that we had in production into our performance environment.
That took 17 minutes. And all that was, was just running Terraform across all of the different services to implement that. Next, with SREs, we've also got automated SLOs. SRE is all about our SLOs and SLIs, but now we've got them automated. Even if we have a new service going in, we can automatically see the SLI and then configure and work with the product owners to make those SLOs.
Finally, there's a big toil reduction. This is the time that we spend doing repetitive tasks. Having monitoring and alerting defined as code means that once we've done it, we never have to do it again. Through all of this, we now have all of our infrastructure defined as code, all of our applications defined as code, and all of our observability or monitoring defined as code. And through this, I hope you can see, there's been a significant benefit. Thank you for your time.