Hear about Vodafone’s journey building a telco-as-a-service platform using Terraform, Vault, and Consul.
It's been more than 25 years now. I look a lot younger, I think, but I pretend anyway. I’ve been working with cloud for the last 11 years, give or take. But I think it’s to my advantage because those early years were spent in the datacenter or the office, so knowing the difference — and the progression — between the on-prem world and the cloud world.
I've been with Vodafone for the last five and a half years. I joined Vodafone UK in 2018, and Vodafone Group at the end of 2020. I'm part of the IT4IT Platform Engineering team in Vodafone Group. We offer shared services and platforms to the wider organization that support the rest of our business.
I'm passionate; well, you may have understood by now that I'm passionate about cloud and cloud best practices. I'm passionate about cloud cost optimization, which is probably a topic that's close to your hearts as well. I also run something outside my time in the office. I run an AWS user group, so I go above and beyond when it comes to cloud.
I’ve spoken about me, let's talk about the telco-as-a-service platform that we built at Vodafone. The telco-as-a-service platform started with a vision. We run technology cycles in Vodafone, and at the beginning of this current technology cycle, a group of people came together. They decided that it was time to build a platform that was putting the developer at the center, at the focus of our attention.
We work in a very competitive sector. There are quite a few telco operators in this country — but also in the rest of the countries we operate in — that fight for the smallest niche of customers. So, getting lower time to market and getting features out quickly and effectively can be the differentiating factor that can win or lose that revenue we strive for.
The idea — the vision behind this platform — was to abstract complexity. Developers or software engineers often don't want to know about the underlying infrastructure; it's not their job, it's not where their value is best spent.
Therefore, the group that envisioned this platform came up with this idea of creating a layered approach where a software engineering team would just drop their application on this platform — that the platform team would build and run. The platform would need to have a way to run opinionated templates where security is at the heart of our approach. So, a pre-approved template that satisfies security requirements and that enables our teams to get up and running quicker.
Another advantage of running a centralized platform is we can reduce the TCO by leveraging economies of scale. We’ve run into situations where different teams use the same tools and pay different providers — different local sellers — for the same tool multiple times. We were trying to reduce this.
As an architect — from the vision, I look at the requirements. These are some of the requirements that came out of the vision: First, we wanted to be up to speed with the current technology trends. Very sharp focus on containers, everything as code, automation, and API-first. All the bells and whistles that you already know about.
So bringing Vodafone up to speed with current technologies, but also integrating this platform with our code repositories. For example, we don't run our code repositories in the public cloud because of regulations. So how do we integrate our on-prem code repositories with this platform that is supposed to be the easiest thing to adopt by our software engineering teams?
The same goes for observability. We manage and deliver shared solutions. So, we are responsible for the observability platform. How do we integrate that with our platform so that an application that is delivered on this platform is automatically integrated with our observability tools? I will speak a bit more about code reusability later. There are a couple of slides that go into that.
Also crucial to the success of this platform is an approach that can support multi-cloud and hybrid cloud. We work in a highly regulated sector. We work in multiple countries with their own regulations; therefore, one approach doesn't fit all.
Consider, for example, a country like Germany where you have one AWS region. If you want to have HA that goes multi-region, you cannot limit yourself to AWS because the data needs to stay in country; in Germany. So, you need this multi-cloud approach. Other countries we operate in don't even have a cloud region and require data to stay in the country. So, we had to consider on-prem deployments for this platform as well.
There's an element of integration with SaaS products. We'll dive into that deeper in a few slides as well. And the last couple of points —reusability, and extensibility — go back to the element of Vodafone operating in multiple countries and having to adapt quickly to different requirements and local regulations. I will explain later how we took this on board as well.
Then from the vision and the requirements, we also have some key technology principles that we follow. These are not limited to this platform but were definitely relevant to the delivery of this platform in particular. I mention everything as code, and immutable infrastructure. Again, they shouldn't come as a surprise to any of you here. They’re — if you want — buzzwords, but they are also a mantra that we follow when we think about cloud engineering.
You don't want to manage anything manually. You don't want to have anything that's delivered one-off through a dashboard. You want everything to be codified and your infrastructure to be dispensable. You need to be able to erase your platform and recreate it from code. That's the ultimate goal anyway.
That's the point that I mentioned earlier. If you spend time recreating the same solution in a slightly different way, again and again, that's not a good use of your expensive infrastructure engineers or platform engineers. And we have that — it's a reality for almost all companies.
We have teams that deliver similar features in their unique way. They create their own snowflakes. They need to be individually security assessed; they need to be individually managed, etc. We wanted to move away from that. With regard to code reusability, I'll touch on that a bit later, but you'll see that Terraform was crucial for us to obviously enable that.
It’s a mantra that a lot of people use. We adopted secure by design so deeply that a part of our security team is called Secure by Design, which means we move security to the left as much as possible.
We involve our security architects and consultants in the design phase — not when the design is ready for approval, but when the design is being discussed when we are at the inception phase — so that we get feedback from them. We can steer our design in the right direction and have fewer surprises when approving that design to go into production.
You'll have heard about build vs buy. Whoever says that phrase out loud almost implies that one has an advantage over the other. Surely, in some scenarios, one of them is better than the other. But they're not one approach fits all. In our case, if you have a competitive advantage by doing things yourself, then that justifies self-managed.
If you don't have a competitive advantage, then you’re wasting time. Your engineering resources are expensive; why would you have them deploying, learning about the technology, deploying that technology, managing that technology, and managing incidents in production? That doesn't give you a competitive advantage.
If it does, then well, you may even attract those technical resources that your company needs by advertising that you're using a specific technology. But otherwise, there's no reason to not adopt a SaaS product.
One example for us was a non-relational database. We could have downloaded the software, deployed, managed, scaled. It would've taken us weeks, maybe months, of planning to decide the scale of the database cluster, etc. Or we could just integrate with the SaaS product that the same vendor offered, and we would pay as we go. Again, one of the big advantages of the cloud.
A few of the things I've mentioned so far: We have already indicated that we wanted to remove undifferentiated heavy lifting from our teams. Highly skilled engineers are an expensive resource. I said that a few times already. But then maximizing the time spent by your engineers delivering value is where your company can generate value themselves.
In our case, we wanted to focus on the strengths of our teams. If we could have one platform that could support multiple teams, why split that across multiple individual teams? And we're not perfect; we're not there yet. There are still teams that recreate the same things over and over again. But this was going in the right direction of maximizing the value that one platform team can bring to the whole company.
I mentioned the SaaS integration. I wanted to spend a few more words on that. When we evaluate the total cost of ownership for an integration with the SaaS product, often, we don't compare apples to apples. Often we compare the cost of SaaS versus the cost of downloading a piece of software and running on a piece of infrastructure.
Often, we forget the cost in headcount, the cost in resources, and the cost in the time you spend recreating something and detracts from generating actual value for your customers — something that I'm quite passionate about. Looking at the total cost of ownership is often not just the ticket value of what you're spending for a product. It's also the opportunity that you are creating by freeing some of your resources to do something different.
One of the outcomes of our analysis of how to remove undifferentiated heavy lifting is we created a strong relationship with HashiCorp in a few areas. That will come up soon.
You'll be slightly disappointed. I cannot reveal the secret sauce, so I cannot go in-depth into that diagram. It's on purpose. It's small. It's one of my early brain children when I was designing this architecture. But that should tell you that this infrastructure — this platform — is complex. It's not a simple three-tier application that you run on your cloud, deploy a simple template, and off you go.
It's something that's quite complex. One element that becomes crucial in this is that secure by design approach we follow in everything we do — so involving security as early as possible in the design. I was actually very lucky; I was working with a very open-minded security consultant. He was very happy to provide suggestions from his point of view. So, he facilitated my job incredibly, and when he came to approve this design, it was almost pre-approved because we worked together on that.
I hinted at how we took on the needs of different countries earlier. We went about creating and delivering this platform by following a data-driven approach. In this case, data was in the form of talking to our prospective tenants, identifying an anchor tenant — so a tenant that would be the first to adopt the platform with quite vast requirements themselves — and working with them on which features to deliver first.
We adopted an MVP approach. We delivered on only the bare minimum that tenant required. But after that, we also opened the conversation to prospective tenants that would come with their requirements. We would adopt this approach of asking them, where do you see value in this platform? What would convince you to get on board in this platform? What do we need to deliver to make this platform attractive to you?
The last point on this slide is about the shared responsibility model. If you're using AWS — other clouds will have the same — you are already familiar with what shared responsibility means. You are responsible for one part of security, data management, etc. and your vendor is responsible for the remaining part. Usually the physical security, etc.
In our case, it was similar. Our platform team would be responsible for running and managing the platform, scaling it, making sure that enough resources are available for any workloads scheduled by a tenant, etc. But the tenant themselves and their engineering teams — are responsible for running, managing, and supporting their application all the way to production. That model worked on my laptop, it worked on Dev — now it's your problem. We didn't want that. We didn't want to shift responsibility from developers to operations when it comes to managing an incident in production.
We've spoken about what drove us and what was supporting us. Then how did we build the platform itself?
We have been part of an agile transformation for, at least since I joined the company, probably earlier than that. You can say it’s probably about 5-6 years that we've been adopting agile methodologies in Vodafone. Agile, however, doesn't work the same in every company. Vodafone is a huge company with hundreds of thousands of employees.
We decided to adopt SAFe, a scaled agile framework and used that. This project was the first project in our department running SAFe. After that, with the lessons learned with the delivery of this platform, we extended SAFe to other projects.
Agile is well known. I'm not going to spend more time on how agile works but what it delivered for us — especially in a SAFe model, involving product, security, and engineering platforms all together to deliver one common goal. That created some synergies between teams that we really valued — and that created opportunities for us beyond the delivery of the task.
It also gave us the opportunity to have some well-defined goals but also managing to pivot if the goal that was initially defined was not the right choice anymore. Again, data-driven. We would deliver tests, decide, pivot, or continue on that trajectory.
I've mentioned our tenants were key to define how the platform would evolve. The last thing I want to mention is being agile — but also needing to make this process as open and as strong — I guess, it’s probably not the right word, but you’ll get the gist in a second. We wanted to have our teams justifying their choices without being stifled in their delivery and innovation. And, at the same time, we wanted these choices to be documented for everyone to see and analyze in the future.
We did this through TDAs, that were sessions where we would bring product, security, software engineering, and platform engineering together — and all the major decisions would be discussed there.
The proponent of the decision would bring a decision paper. Anyone in the virtual room would be able to comment, to raise questions or objections. At the end of the process, with a public discussion, the decision would be accepted or rejected. This allowed us to have an open process, which then also created some documentation. That was crucial for teams that would inherit the management of this platform at a later stage to understand why did we choose, say, Consul, over a different product.
I dropped a couple of HashiCorp names so far, but where was the value in the HashiCorp partnership? And I want to underline the word partnership because we have many vendors in Vodafone. Personally, I think we don't have a lot of partners, and I think the key there is the kind of relationship you create with the company or person at the other end of the line when you need some help or support or when something is wrong.
With HashiCorp, we found we were very well supported and could build a relationship of trust. HashiCorp was very open in admitting when their product had some bugs that needed fixing, and they were very quick in fixing those bugs for us — prioritizing the work they needed to do to get some stuff to production-ready.
At the same time, trust only goes so far. We wouldn't be partnering with HashiCorp if there weren't tools we see value in. You can see a couple of names there. Consul in particular — we use Consul Enterprise in our deployments — and encryption and secrets management, Vault. Again, Vault Enterprise. In our case, in a highly regulated sector. It follows that we're using products that we self-manage. Do we see a competitive advantage in managing those? Maybe not, but regulations trump competitive advantage, so we had to make that choice.
Obviously, Terraform is one tool that anybody serious about infrastructure as code uses. Our global account manager told me that we at Vodafone downloaded Terraform Community about 10,000 times in one calendar year — which is a fraction of the number of employees. But you can imagine the volume of downloads we had.
We're not limiting our partnership with HashiCorp to these tools; we are exploring a few more. I'll have a slide at the end of the presentation discussing a bit more where we're going with HashiCorp — which actually is about now. I didn't realize I'm at the end of my deck, and it's about on time as well.
Is this platform the finished product? Have we achieved our nirvana? And are we satisfied our job is done? Not quite. We learned a few good lessons, but we also have a lot more work to do.
I probably skipped the part where I mentioned multi-cloud, hybrid, and on-prem cloud are key. That triggered a piece of work for us, which is how do we secure cloud-to-cloud networking or cloud-to-on-prem networking in a way that is not only consistent, not only up to the standard required by the regulators but also enabling us to be more agile in the way we manage our applications and our services.
So, this work on zero trust networks, we are exploring different areas where you may have machine-to-machine or human-to-machine access, and you probably already know from the Azure presentation, we're playing around with Vault and Consul in the machine-to-machine space. We're evaluating Boundary in the human-to-machine space.
We are also looking into identity-based access controls. Moving away from that IP-to-IP — something that Nick mentioned earlier and something that Armon mentioned in his initial presentation: We don't want to be aware of IP A talking to IP B. That doesn't give us any value.
We may have IPs being reassigned, and those rules may be stored on our firewall. We want to know what service can talk to another service and do that dynamically. If tomorrow, an engineering team delivers a new service — and they can see from a catalog that a service already exists that's part of the job — they need to be able to plug into that with as little effort as possible. That's where we're going with this IT network modernization project.
We are expanding the scope for Vault to additional use cases. Some of these are already in the production stage, and some of these are in the works. Vault has been selected as one of the products that are key to protecting our core networks — the network your mobile uses to make a phone call, for example.
We are also trying to modernize our identity management tools. Moving away from the process. Something Nick mentioned —you go through multiple stages of approval just to get a user into a user group so they have permission to do something somewhere in the cloud or on-prem. We want to modernize that. We want to make that process as simple and streamlined as possible. So we are exploring what can be done with Vault in that space.
We have an IoT offering in Vodafone. I admit I'm not involved with that piece of work, but I'm aware that they use Vault heavily and have it in production already.
Last but not least, we have teams evaluating the advantages and opportunities offered by Terraform Cloud and Terraform Enterprise. I mentioned 10,000 downloads of Terraform Community. We can probably do better or probably remove some of the heavy lifting there as well. Is that coming at the right price? That's what we are evaluating now. Can we gain a competitive advantage by going down the cloud route or enterprise route versus the self-managed route?
This leverages Terraform modules where we offer a unified repository for infrastructure as code modules that serve as blueprints for all the teams to then adopt. It's not mandatory; the repo is internal and is security-approved. The templates themselves will be security-approved. A team that wants to deploy infrastructure in a certain way can make use of those templates, reducing their time to delivery.
But they're also free to take those modules, make some changes and modifications, and then either contribute back to the OneSource repository or just keep them locally. Whether Terraform Cloud will help us give more visibility around how these modules are used, that's something that we're also evaluating.
That's the end of my presentation. Thank you very much for listening to me for the last half hour. I'll be around for lunch if you want to ask any questions.