Learn how G-Research leverages HashiCorp Vault environments to secure the self-service GitOps delivery of 1000+ Vault namespaces using Jenkins, Kubernetes, and HashiCorp Terraform.
Speakers: Morgan Atkins, Peter Barta
Hi, everyone. Thanks for joining us. Today we'll be talking about Vault and Vault namespaces. My name is Peter, and I'm joined by my friend and colleague Morgan.
To start our presentation, I'll give you a brief introduction about G research, the IAM team, our objectives — and then catch you up on a brief history of Vault at GR.
We'll then begin covering the main points of our presentation, talking first about the initial efforts to implement and productionize Vault — and where they fell short. Next, Morgan is going to take you through how we settled on Terraform and templating as the keys to overhauling how Vault is used at GR. We'll then share what Vault Namespaces looks like today and wrap up with how we've encouraged the adoption and usage of our new model.
To kick things off, a bit of background about what GR is. G Research is Europe's leading quant research and technology firm. We analyze datasets in an attempt to predict what the market is going to do according to that data. Fortunately, we're not too bad at predicting that, and we're able to come to you and talk today.
Over the last few years, we've been transitioning to private cloud to support these research efforts and the applications that enable that. We, and the IAM team, have been supporting these efforts by securing the applications that are moving to this private cloud environment, using our new secrets management platform, Vault.
So to rewind the clock and take you back to the start of this private cloud environment, 2018 is when it all started, and we selected Vault as our one-stop-shop for our secrets management needs in our new private cloud environment.
Through the rest of 2018 and into early 2019, we began the productionization of Vault and started to get comfortable with saying it was ready for use. However, we were still finding our way with the usage patterns and had some operational overheads with our configuration and infrastructure as code approaches.
In 2020, our early missteps in the productionization came back to bite us in the form of operational burden and friction for both the customer and IAM. We had hundreds of namespaces, and teams across the business were using Vault, but it was more and more painful to make even the simplest of changes. This takes us to where we are today — nearing the completion of an overhaul of how Vault is used at GR, we're finally at a stage where we're seeing the benefits of our efforts, and we can reflect on a journey and share with you what we've accomplished.
Looking back at 2018 and 2019, I mentioned there were some initial oversights in how Vault was being managed at GR — and the main culprit was a dependency on Ansible. Deploying Vault and Consul with Ansible is great. But the rapidly growing number of teams dependent on Vault for securing their applications were starting to highlight the constraints of having a monolithic repository being deployed by Ansible.
Ansible wasn't the only problem. We had two approvals for every change from IAM — checks that were taking upwards of 30 minutes., We were seeing longer and longer deployment times — upwards of two hours at one point, causing huge friction and a terrible customer experience. We reached the tipping point. Our monolithic repository was nearing 23,000 lines of code, and it was painful to use.
Given those challenges, a decision was made to begin a body of work that's goal was ultimately to correct the previous design and implementation missteps. The main focus for this was reducing operational burden on the IAM team and standardizing the infrastructure as code approaches used when consuming Vault.
The requirements were to reduce the deployment times, reduce the dependency on the IAM team for things like pull requests into our GitHub code repository and general support tasks required when customers were using Vault.
Next was to reduce the customer overhead, making it easier for them to configure and manage their Vault configuration and lowering the bar of entry to allow more users to consume Vault. Lastly, federating the configuration ownership. Giving teams the power to build what they need, how they need to build it — and breaking down the monolith to smaller units so that it can be more easily tested and deployed.
So how did we do this? Well, firstly, we began by asking the customers what they needed. We ran a four-week discovery phase. Because it was early 2020, we were able to do some of these face-to-face — if anyone remembers how to do that. We sat down with all of the application owners inside GR that were using Vault in a meaningful way. We were able to gather all of the key requirements for their applications and distill those requirements. Eventually, we came up with an operational standard.
We took that operational standard, and we codified them. We codified them in the form of Terraform modules, allowing for the easy consumption of generic Vault services like KV secret engines or OIDC auth methods and database configuration.
For each of the key requirements gathered from the application owners, we were able to produce a custom Terraform module. This allowed for a pick-and-mix style approach to creating your namespace. This enabled customers to pull in additional functionality after their initial creation of their namespace as their use cases and requirements changed over time.
For instance, when we first deployed our Vault namespaces, the standard auth method was LDAP. As 2020 rolled on, the security requirements changed, and we needed to move all of the namespaces that were currently using Vault — all of the teams that had been spaces inside Vault — to use OIDC. Because all of our net new namespace teams were using our module method, we were able to hot-swap the LDAP custom module and put in instead of the OIDC module with effectively zero downtime and disruption to their day-to-day work.
Each of the custom modules also were versioned independent of one another. This allowed for feature toggling over time and blue-green deployments as we moved and enabled more functionality over time.
Each namespace given to a team acted as a metamodule, pulling in the custom modules as and when they needed them. Finally, supporting all of that, we implemented guardrails that would allow 90% of our customers to consume Vault with little to no interaction from the IAM team. Most notably base policies. These base policies restricted some functionality within Vault, providing smooth user experiences while maintaining flexibility for future use cases. We'll talk a little bit more about that later.
If a team decides to provision a namespace using our model, they receive the following authentication methods. The authentication types we support are based on the requirements gathered in the first initial discovery.
These are things like LDAP — as we've already discussed — OIDC, Kubernetes integration, service accounts, and AppRoles. The ACL guardrails — that we've already discussed — prevented misconfiguration and misuse when users were self-serving the namespaces. One challenge when we initially designed the solution was we didn't want to have infinite child namespaces created inside our Vault namespaces. This is where the initial guardrail and base policies were born. We needed a way of preventing users from creating resources that would put a strain on our Vault infrastructure.
In addition to the obvious KV V2 secret engine and database secret engine, we also created two additional secret engines. First, GR metadata. This endpoint stores owner information, application ID, data integrity, and classification status.
This allows us to use an external tool called Vault SAP to periodically query all of the namespaces inside Vault to understand who owns them, what they're used for, and what application they are serving.
Lastly, we created the secret deposit. Because the initial implementation of Vault had everybody in the root namespace, it became a culture of cross-secret-engine-communication. Secrets were being shared inside Vault between teams, which flies in the face of what Vault should be used for.
Being the IAM team and understanding these things would take a long time to unlearn, so we provided teams with the ability to airdrop secrets from one namespace to another. If the database team, for instance, was to write into the secret deposit path, they would not be able to read back that secret. But the owner of that namespace would only be able to read that secret.
Reflecting on a horrible configuration repository, the long checks, and longer deployment times, refining the customer experience was paramount. We looked to revamp the process from the moment an engineer was servicing it — from the moment an application or service engineer needed to start using Vault.
To get started, users would clone our Vault Terraform namespaces repository and work through an initial script, prompting them for things like the Vault namespace name, its classification, and what application it would belong to.
Next, the user would submit a pull request to the repository, which would be reviewed by one member of the IAM team — and only once. Fortunately, from this step on, IAM will no longer be involved. The resource owner defined at the creation step would be prompted for future approvals for any configuration changes.
Next, we went on to testing and deploying the pull request. We overhauled this process by beginning with the testing. This was done in an ephemeral container to ensure any code deployed against Vault would run fine and not affect our production deployment.
Once this was confirmed, and we were happy with what the result was, Terraform would kick in with a plan and apply. Users would end up with their Vault namespace in a few minutes once the code was merged. We had much happier customers.
Despite investing all this time in developing this new solution — that was the best thing since sliced bread — we still were hearing a lot of negative feedback from users and stakeholders across the business. Understandably, we found developers were hesitant to accept the need to redo how they were managing the secrets for applications that they had already configured.
Managers were unhappy with the prospect of their resources needing to commit their time to fix something that they didn't believe was on them to have to fix in the first place. Even within the IAM team, we were hearing sentiments around the problem that we were going to have another configuration repository to manage — more approvals, and more things to support when things would eventually break.
We needed a strategy and a message to share with the business to highlight the importance of moving to this new model and facilitate that adoption as best we possibly could.
Once the technical part was finished, it was time for us as the IAM team to have a battle of hearts and minds.
Migrating hundreds of namespaces isn't easy, so we began the first step with hand-holding early adopters. This meant that the first handful of teams that were joining us on our journey would get the first-class service. They would get early access to the features that we were generating in return for their invaluable feedback about the process, the documentation, and any features that they felt were lacking or needed improvement.
Once we finished with stage one, we opened up to the rest of the business. We ensured there were dedicated channels for support using Slack, email, and side-of-desk when people were in the office. This ensured anybody onboarding on to the new process had immediate access to the IAM team and the people who created the solution.
Once we had established an operating model for how to deliver namespaces into the organization effectively, we announced that we were turning off the creation of namespaces through the old Ansible model.
This had adverse effects because more and more people were learning that their method of managing their namespace was going away, and they had to learn a new process. Luckily, word of mouth grew, and the adoption for the new process that was more streamlined, and faster, and more secure was the right way and right thing to do. More people were coming to us to request that they have access to the new model to deploy their namespaces.
Then, we have to tackle the namespaces that were already being configured ‚— not just net new. We — as the IAM team — migrated over 100 traditionally managed namespaces from Ansible into the new model because they were crying out for a faster, more effective way to deploy their namespaces. This obviously had other challenges around changing the way that they interacted with Vault and their applications. But using our support channels, we were able to give them a first-class service. Decommissioning the old model — we're working towards turning off Ansible and its features in its entirety.
As Morgan has alluded, we're not quite at the finish line. We're finally at a stage, though, where we can reflect on our journey and share what we've learned with you today.
Earlier this year, we've stopped allowing any namespaces to be created using the old model, and we've stopped having most of the configuration being deployed from Ansible as well. We're at a stage where we're broadening the feature set available within namespaces, and we're expanding on the nice-to-haves that encourage people to migrate.
Later this year, we're hoping to move full steam towards decommissioning the remainder of the Ansible base configuration deployments. Hopefully, we'll see a major milestone hit and exceeded, which reflects a lot of work of the past 18 months for us.
We look forward to looking beyond Q4 and the things that it offers for our roadmap. It offers features like tighter integration with other services and potentially more custom plugins — and, for us to finally start focusing on developing more things for Vault now that we've got our customers happy and enjoying what they're using.
In closing, we'd like to share a few final thoughts, and hopefully you can go away and apply these to your own situations:
Firstly, it's critical to ensure that you don't pick what's right and easy for the time that you're in right now. You've got to ensure that you've got the right tools that meet your business needs for now and well into the future, having understood your requirements.
Understanding that the operating model and the stakeholders’ requirements must be tackled together. There's no point designing a solution in a vacuum to later find out that the requirements were never met or have changed in the meantime.
To expand on that, the adoption cannot be done separately from the solution. Engagement with teams across the business has been critical in the take-up of our model. In fact, it's been important for us to get to where we are today with the support from other teams — and I don't think we could have managed this without having close ties to the teams and functions across the organization that represent using secrets within applications.
Finally, success is not just dependent on the technology. As I hope we've demonstrated today, it's important to look at the people and the processes as well — and give them the time that they deserve when undertaking a migration such as this. I'd like to say thank you very much for listening today. I've been Morgan Atkins.
Have a wonderful evening.