Find out how TomTom adopted Vault, integrated it with Kubernetes, and built a large amount of audit and observability utilities, mainly with Prometheus.
Hear about the process of adopting HashiCorp Vault Enterprise at TomTom and what some of the challenges were. This talk will take you through lessons learned around auditing and telemetry, as well as adoption and integration into GitOps and Kubernetes workflows.
Hi, my name is Neven, and I'm going to show you how and why we introduced HashiCorp Vault as a secrets management solution and what challenges we encountered. But before that, I need to ask something from you. I promised some people that they will see me on this stage, but they, unfortunately, couldn't be here, so please smile.
A little bit about me: I started my journey as a Linux system administrator in an ancient time of black and green consoles and terminals. The next step or evolution of that journey was moving into cloud. After caring for large cloud deployments, I started to care more about security, so I moved into security for a while. And realizing that security was really a lot of paperwork and not more in hacking, I moved back into Linux system administration and then to something called DevOps. And as I had some experience with security — DevSecOps.
As I'm passionate about new technologies and R&D, I joined TomTom three years ago, and I'm working there as a senior application engineer and a YAML programmer, leading the secrets management team. If you want to contact me after this presentation, these are my handles on LinkedIn and Twitter.
If you're not familiar with TomTom, we are map makers. We are providing geolocation technology for drivers, car makers, enterprises, and developers. Our highly accurate maps, navigation software, real-time traffic information, and APIs enable smart mobility on a global scale, making the roads safer, driving easier, and the air cleaner.
So this is the agenda for this presentation. I will talk a little bit about secrets management. Then I will show you HashiCorp Vault's timeline in TomTom. I will show you how HashiCorp Vault was embraced in the developer experience in TomTom. You will learn some of our use cases in Kubernetes, some extra tools we developed to help us with that experience — and then I will give you some takeaways at the end, if we still have time.
I was told that every good presentation starts with a story, so let's start with a story. This is my spider. It's a Brazilian Black Tarantula. As I was really afraid of spiders, I thought that getting one was a good idea, so I got this spider, and I got the terrarium for it. At that point of my life, I was working as a Linux system administrator that was also in charge of security for the company.
I had access to company keys, passwords, and I also needed to keep them safe. I stored them on a USB flash drive for safe keeping. And thinking about the best place where to store that USB drive, I saw a plastic zip bag on my table, and you can probably see where this story is going. The original idea was pretty straightforward, hiding my USB drive within the spider terrarium inside the zip bag. And that's also what I did.
I managed to recover those keys a few times successfully. The main problem here is that every time I needed a password or any credentials from that USB flash drive, I needed to open the terrarium, put my hand inside, fight my fear, fight the spider, recover the flash drive, put it in my laptop, extract the keys, and then to do the whole process in reverse. And as someone who had trouble with passwords all my life — who had a lot of CSD files and notes, and what-not — I realized that this was not an optimal solution for this problem.
There was a solution for managing secrets a long time ago, and as many of you know, it was called ConnectionStrings. The only place where you needed to keep your passwords because you had one server, you had one database, you needed to connect it — ConnectionStrings was the only place when you needed to hide your password.
But as your team grew you got more and more developers. You get more people; you needed to share those passwords. And what was the best way to share your passwords? Of course, email. That's the most popular way of sharing passwords, being only shadowed now by Slack as the new primary medium.
Back then, we also had these really cool password-sharing platforms. They were working in offline mode and online mode. And, talking about passwords, CSD that were on some shared drive or in some version of an online document management system. And we cannot forget about good old, hardcoded passwords in Git repositories. Also, most of these passwords were used in all the environments that your company had.
Moving into the future of the apps now, they're running in multiple instances of microservices, communicating with each other, communicating with redundant databases — various frontends, various backends.
So, the question that came naturally is how to secure these credentials. This is where the concept of secrets management solutions comes into play. Secrets management is used to manage and protect your secrets and deploying your application without having hardcoded passwords for your infrastructure. The infrastructure, in this case, is really a necessity.
There are also no more messy codes with various environment variables running around. And in the case of TomTom, the requirements for a secrets management solution were pushed by developers and by the management at the same time — which is really strange. That also meant the secrets management solution needed to fulfill some of the requirements that were gathered.
The requirements were pretty straightforward. Auditors — the audit team in TomTom — requested us to have a single source of truth with an audit log so we could see who used the password, how, when — and we could have this whole trail of password usage in the whole company.
Developers wanted to have programming access to the solutions, so API the interface is the main thing that they needed. From the security team came another requirement, and that is for secrets to be transmitted securely and encrypted at rest. From our SRE team, we got a requirement to have a highly available solution running in multiple regions. And also to have configuration infrastructure as code.
Multiple authentication methods were also required — so if one of them fails we still have some way to access our secrets management solution. And, secrets versioning, which is a lot of help, if you have teams that change their passwords, but they do not change them in their CI/CD pipelines — so at one point, you need to recover the old one to help them with this.
Out of all the different vendors we tested at the time, we chose HashiCorp Vault as the primary one. I will quickly show you the implementation timeline of HashiCorp Vault in TomTom so you can understand how this process looks in the enterprise. This is our timeline, and if you go step-by-step, you'll see that the timeline for HashiCorp Vault is an interesting one.
The first ticket that we got about a secrets management solution goes way back to October of 2017. There was a need for a secrets management solution, and HashiCorp Vault open source was recommended at that time.
Only a year and a half later, which is a really short time, if you ever worked in an enterprise or a bank, we finally managed to do a POC of HashiCorp Vault, running on a small virtual machine in our open stack platform, and a few members of one team were using this solution.
In 2019, you remember, it was the year of Kubernetes, it was big, big enterprise IT as this new big thing, so naturally, our next proof of concept needed to be around Kubernetes. At the same time, HashiCorp announced Helm charts for deploying Vault Kubernetes. So we had another challenge: Learning about all these new technologies and also trying to build a successful secrets management platform.
This POC ran on Consul, it had an open source edition of Vault, and it was running for some time. After some general usage from a couple of teams, we decided to finally try Vault’s enterprise features. We ran two proofs of concepts of Vault — and running this one at the same time because there were still teams using it.
The first proof of concept of Enterprise was in March of 2020, with an emphasis on exploring the different modules that HashiCorp Vault provided. As our trial license provided us access to all the modules, we went through them all and tried to find which modules fitted us best.
Then the second proof of concept was tested, emphasizing on comparisons of HashiCorp Vault Enterprise with similar secrets management solutions. And naturally, for us, two of the most promising features that we saw were the ability for us to do our own disaster recovery and namespaces.
The next few months, we were writing documentation, documentation, and more documentation. By documentation, I don't mean Vault usage, but all the architecture and all the documentation that is needed to enroll the solution in the enterprise: Where this solution is going to connect, who is going to use it, how, all the moving parts — basically everything. We also had reviewers from our internal teams like SRE, cloud center of excellence, and security.
After all of that, we were finally ready to go to the MVP phase — a Minimum Viable Product, with HashiCorp Vault Enterprise systems that had only a couple of users. But those couple of users show as power users., At the time, they provided much-needed feedback for us to finally move to LA, which is Limited Availability. HashiCorp Vault was then introduced to a limited number of teams that were using the most basic features of Vault — just the KV store and namespaces.
A few months after that, after more testing, more writing documentation, more feedback, more usage, HashiCorp Vault was finally signed as a Developer Productivity Standard for secrets management within TomTom. As you can see, it takes some time for a product to go from this introduction phase to being considered the company standard.
One more thing, in April this year, we replaced Consul. We tried to simplify our backup restoration process. This was a complete timeline of Vault's journey in TomTom. And as Vault was introduced as a secrets management solution, there were some expectations from this new tool.
In TomTom, we are trying to provide a great developer experience for all of our developers, which if you really think about its user experience, if your users are developers. And as in every puzzle set, you get all these little pieces that you need to put together to get a bigger picture.
The same is in the developer experience, where all of the tools that your company is using are these little pieces that you need to put together. And as HashiCorp Vault is a pretty big piece of the developer environment, we wanted our users to have a great experience interacting with this solution and using Vault as a secrets manager.
One of the most essential things that we do in developer experience is onboarding and onboarding procedures for any solutions that we are using. And HashiCorp Vault is a tool that is designed with DevOps culture in mind, and all DevOps concepts are applied to Vault — like automation, CI/CD, infrastructure and configuration as code, and many more. And we use all these concepts in our day-to-day work, from creating the initial Vault infrastructure to configuring Vault performance, replication, disaster replication, or just onboarding new users.
The idea for this onboarding process was really simple. We wanted our users to provide us two pieces of information to onboard on Vault. And those two pieces are the name of the new namespace that they want and the name of the active directory group. They use our self-service portal to enter those two pieces of information — and after that, after they click finish, they get an email with all the guidelines on how to connect to their own namespace.
What happens in the backend? Pipelines are run; they're running Terraform. Terraform connects to Vault, configures new namespace, configures all the policies, and provides admin access to this new team.
When we were preparing Vault for all these new users, we encountered a couple of major challenges as well:
And as you may know, driving the adoption of a new service is never an easy task. If you have ever played baseball in a corn field, you will know that the saying, "If you build it, they will come," will not work in IT.
Many users didn't know about Vault. We had low initial onboarding for users. They didn't understand the value of secrets management and what it means. So, at the beginning, we had this low initial number of users.
To overcome that, we created these monthly stakeholder meetings with our current users providing interesting use cases that some teams implemented. We also showed them the new features in Vault that were out. We also tried to provide good use cases that maybe they can use in the future.
This was effective, as we immediately noticed a rise in users and teams. And after every iteration of the stakeholder meeting, we announced this in all relevant Slack channels, and we noticed that we are getting more and more users.
People, in general, don't like change, and introducing a new solution to developers where they need to rewrite part of their code is never a nice option. In the absence of a primary secrets solution at that point, developers used third-party ones, they used cloud native ones, they wrote their own solutions — or they didn't use any solution at all. To help them, we created these detailed procedures for migration from their current solution to HashiCorp Vault.
Most users will automatically use all the default policies, default roles and keep on putting secrets in your namespace. They will not use best practices because they don't even know how to do that.
To tackle that, we created something called an onboarding call — that we do with every team requesting a new namespace. There we try to explain to them how to use Vault in their own respective programming languages, and we tried to push the best practices on them before they start using this solution.
At the time, they were a big challenge for us. In the end, we managed to convince most of the teams. But the main point here was to have a lot of discussions, a lot of meetings — and trying to convince them why using the central solution that the whole company is using for secrets management, is the best way for them to use this secrets management solution.
One of the biggest challenges was documentation because this is by far the most requested thing we get in our support channel. And we try to offload all official documentation to HashiCorp Vault because that documentation is amazing — so we didn't want to copy-paste the same documentation around.
Most of our documentation is just frequently asked questions that we get in our Slack channel and some specific TomTom use cases. Because we realized that most developers don't read the documentation until they stumble on a problem, so our onboarding call — that I already mentioned — proved a good thing to solve this.
When they are requesting a namespace, they will get all these examples in their respective programming languages. Also, with the documentation, we added a small Terraform sample for teams so they can use Terraform to configure and manage their namespace.
As in any other enterprise, Kubernetes is used on a large scale in TomTom. As most of you know, secrets in Kubernetes are encoded, not encrypted. We needed the solution to move from the static Kubernetes secrets to something more secure — and that, of course, is the integration with HashiCorp Vault.
So, one thing we really wanted to do was enroll these Kubernetes clusters that were Vault-ready. And by Vault-ready, I mean that when you enroll the clusters, in the end, you can consume secrets from Vault.
To leverage this, we created a workflow to experience a part of our GitOps culture. Everything starts in the GitHub repo, where our Kubernetes reference architecture template resides. The Azure DevOps pipeline is used then to deploy resources with Terraform from that GitHub repo into the Azure cloud.
Terraform then deploys Kubernetes. It deploys and configures the container storage interface provider or Vault Agent, depending on what you choose. Terraform then gets outputs from Kubernetes, JWT token, host, and certificate. It then uses that output to configure the Kubernetes Auth method in Vault — and in the respective namespace.
And when all pipelines are completed, that Kubernetes cluster is ready to consume secrets — and our application is ready to use them. Everything is done with one workflow from an initial Kubernetes deployment to Vault configuration and secrets ingestion.
Some things we found out while doing all of this are mostly around monitoring. Yesterday you had the opportunity to hear how Prometheus is a great tool to monitor things. And we use the Prometheus alert manager and Grafana tech to monitor most of our tools and services.
One of the great things is that Vault provides telemetry that Prometheus can consume. We monitor a lot of Vault metrics in Prometheus.
The ones that we would recommend for you to really pay attention to, are the number of inactive or unused tokens, which are a good indicator that someone misconfigured something. We had a couple of cases when we had tens of thousands of inactive tokens that were later traced to an Auth method that was not configured properly.
This metric is used, so you can track how your users are experiencing Vault and how fast Vault is for them. Also, a rise in the duration of requests can indicate problems with networking — problems with your backend, or some other stuff.
yYou heard it yesterday. This is a no-brainer. The first alert that you need to create for Vault should be the seal status. If your Vault is sealed, then it's unusable.
This metric is powerful, and we are using it because we are providing namespaces for other users. We can also see misconfigurations or abuses from teams that are trying to do something they shouldn't do.
And here is the thing that I'm personally most proud of. We created some tools to help users use Vault in TomTom.
One of the first tools we created for users is a simple policy generator. We wanted our users to have the best experience when creating policies and when learning how to create policies. It's a Python-based solution that generates HashiCorp Vault policies, with required permissions that then developers can use for their own specific policies that they need to implement.
The solution is you specify a PET; you specify all the permissions and policy in Terraform or HCL format is generated as an output of this solution. Also, the solution can check if there are some bad practices in policies, like all the default permissions are using pseudo and many more.
Our second tool that we created answers the question: How can we track the expiration of the secrets in Vault? As with any other company, we had some issues with secrets expiry. We are using Azure. Azure has this service account called Service Principles. Service Principles have secrets. Secrets can expire, so we were also storing them in Vault. We also had some other secrets that had expired, so we needed one solution to tackle this secret expiry.
As I said, we are leveraging Prometheus on a large scale in our team. So, what we did was straightforward. We created Prometheus Exporter for monitoring secrets expiry.
How does it work? It's a simple thing. It uses an AppRole to go through configured secrets. It gets the secret expiration from the metadata, and then it provides this data for Prometheus to scrape. With this, we got an excellent way of getting notified when secrets need rotating or the secret is about to expire. And you can also create simple Grafana dashboards to track them all in one place.
We have also open sourced this tool. It's available to you at tomtom-international\vault-assessment-Prometheus exporter-GitHub-repo. I will also show the GitHub repo at the end. And while I was giving this presentation, we rolled out version 1.0 today.
Secrets management is a must for any company. If any of you work for a company that doesn't have any secrets credentials or certificates, I would like to talk with you after this presentation, because that would be really interesting.
Manage your expectations when you are enrolling your solutions because users will need help, and you will need to provide some best practices for them to embrace your solution. Also, try to create tools to help users — to help them address all the issues that they are having.
Try to be data-driven. We are monitoring everything from user adaptation of our solution to most-used engines, most used secrets — all the infrastructure we have that is connected to Vault.
Try to connect. This is, let's say, more global advice. At one point we tried to find companies that were using Vault Enterprise as well. We tried to connect with them to see what their problems were, how they built their own solution.
You can always learn from other people’s mistakes, and you can also find some good things from different companies about how the solution can be used and enrolled on a totally different level than maybe you have.
And, of course, try our vault-assessment-Prometheus-exporter. If you find it interesting, please store it. And that's it. Thank you. Feel free to approach me after the presentation. I would like to thank my Vault team, Eugene, David, Mateusz, and Michal, for getting me here.