Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

Learn how to deploy a multi-tenancy Vault Enterprise environment using a combination of Terraform, Vault, and Packer to enable Vault namespace self-service.

In this session, learn about automating Vault deployment and namespace creation for individual teams and Terraform landing zones to provide Vault as a Service to the organization.

»Transcript

 

Bryan Krausen:

Hey, everyone. Really wanted to be excited enough to run up here like Steve Ballmer introducing Windows 95, but haven't had that much coffee yet. Thanks for coming to our session, “Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones.” A lot of buzzwords in there. We'll talk about what we're going to talk about soon. I'm Bryan, this is Josh. Quick introduction — Josh, want to introduce yourself?

 

Josh Poisson:

Josh Poisson, cloud engineer on the platform infrastructure team. We're responsible for — our team is responsible for managing resources in AWS, Azure, and GCP, hence why HashiCorp's a great fit for us there. I'm currently more focused on managing all the tools that we have — our DevOps tools, Terraform Cloud, Vault Enterprise, Consul Enterprise, Packer as well.

 

Bryan:

I'm Bryan Krausen. I'm a principal consultant, been working with Vault for way too long — six years now probably — specializing in HashiCorp tools and AWS, mostly Vault, Consul, Packer, Terraform, a lot of them. I'm also an instructor; the company I work for, we are HashiCorp's official training partner. I've taught a lot of people a lot of these different tools. I have courses on my own where I've taught over 40,000 students HashiCorp products. Super happy to be here. This is my fourth HashiConf in a row presenting. Doesn't get easier, but I’m always excited to be here.

Talking about buzzwords in our title, what are we actually talking about today? You're going to get a presentation from both Josh and I. The kind of relationship between Josh and I, Josh works for Company X, so I was a consultant. We came in and Josh and his organization had requirements to deploy Vault to satisfy a lot of these different requirements that we're going to talk about today. Basically I'm going to be talking about the stuff on the left: I'm going to introduce the business use cases, why Vault, and then we're going to go into the deployment piece. So, why Vault, what challenges that we really need to solve within the organization, and then we'll go into the technical piece of how Vault was really deployed in the organization. After that, I'm going to hand it over to Josh.

Josh is going to talk about the requirements, and more or less the requirements from the development teams and the different business units within the organization, on how we're going to now configure Vault once it's up and running. And he is going to talk about the automation team, and this is where the namespace automation comes in. Configuring Vault for consumption by different business units and developers, talking with those teams, identifying the needs from each of those business units — everyone has a different need on why they want to use Vault depending on their application — and, of course, automating the namespaces. That's the whole landing zone piece, so when Josh gets a requirement from a developer or a business unit, they can lay down the infrastructure for that business unit, the infrastructure to run the application. On top of that service that they're providing, now we're also providing Vaulting capabilities on top of that. That's what we're going to be talking about today.

»Why Vault?

Why did Josh's organization really need to deploy Vault? Those are the things we're going to be talking about here. They had a need for isolated environments for all these different business units and development teams. Each of these development teams had unique requirements for the organization. They all ran different applications. Some people may be using AWS, some people may be using Azure static keys. Maybe Josh's team needed dynamic credentials to be generated for things like Terraform Enterprise or CI/CD pipelines, things like that. A lot of different business units within the organization had a lot of different requirements on how they wanted to use Vault.

Uniformity and standardization. I think every organization wants to follow down that path. I put drive efficiencies and standardization — even though we have all these different use-cases, we still want to standardize across each one. And some of the stuff that Josh is going to talk about, like automating the namespaces — for instance, every time a namespace gets stood up, we're stamping out the same KV structure and policies — your standardization across the board there. Just to make it easier for Josh's team to manage Day 2 ops for these teams.

Josh's organization had a cloud-first strategy, like many organizations have today. They currently use AWS and Azure because they have applications running in both, so obviously we need to deploy Vault in both. We want to follow that cloud-first strategy, so that's where Vault is deployed, and we'll talk about that as well.

Centralization of security. Vault is a security tool. We also need to audit what people are doing inside of Vault. So, talking about auditing for assurance across the organization. And a little funny story about the audit stuff that we'll talk about as well.

Managing secret sprawl. Pretty much every organization that deploys Vault, at least from my perspective, that's the number one goal. I talked about this in the hallway track yesterday, but most organizations have secrets in Kubernetes and Jenkins and Puppet and Chef and all that kind of stuff. We want to centralize all those secrets into a centralized solution. That way Vault becomes the central solution for identity and credentials across the organization.

Finally, one of the big goals that we want to do is make sure that we can manage this — not only deploy but manage this — from a Day 2 operational perspective through code. We deployed everything through code, but we want to make sure that Josh and his team can easily manage the solution moving forward through code as well — submitting PRs, making changes. We're going to talk about Packer and Terraform and all that good stuff.

»Open Source To Enterprise

With all that said, all the requirements identified, Josh's team deployed Vault Open Source successfully. This is before I came into the picture. Open Source was successfully deployed. We know it was successful because a lot of the developers and business units started consuming Vault. They started becoming reliant on the Vault platform.

Now because of this, Josh's team realized, “Okay, now a lot of our applications are highly dependent on Vault, but we're just running Open Source. Obviously we have redundancy within a cluster, but we don't necessarily have redundancy across clouds. We don't have replication, all the fun stuff that Vault Enterprise gets us.” So because of that, Josh and his team reached out to Hashi, they bought Vault Enterprise, and this is the stage where my team came in to help deploy this highly available Vault infrastructure.

Any time I do a Vault implementation, it's always critical to determine what are the hard requirements for this deployment. Every organization I've worked at, the requirements are slightly different. There's always this commonality in terms of, “We need to have high availability, we need replication because we have cross-cloud.” Some organizations have active-passive setup, some have active-active. In Josh's case, we have active-active. I always deploy Vault in terms of following the application architecture. If you have an active-passive setup in your organization, where you have a primary data center and you would fail to your secondary data center, that's typically how I would deploy and design Vault as well so we don't have to pay for licensing over here, we don't have to worry about deploying additional clusters.

In this case we have an active-active, and across multiple clouds as well — not only multiple clouds, but multiple regions within each one of those clouds. Hence, the title of our talk here.

As I mentioned, we wanted to manage Vault using infrastructure as code and more of a GitOps workflow. Any time we wanted to make changes, we simply submit a PR for changes against Packer or Terraform or those kind of things, and then we will be able to manage the solution that way.

Including additional tooling: I mentioned the audit logs before. Obviously Vault, again, is a security tool. We want to make sure that we're auditing everything that's happening within there. Not only that, but being able to visualize that stuff in something like Datadog, which Josh's organization uses, and then be able to capture telemetry metrics in there as well. And then, finally, configuring Vault Enterprise across all of our clusters. We have a very large replica set, as you'll see next.

»Multi-Cloud, Multi-Region Architecture

What is the architecture that we end up deploying? Now, this is a somewhat abbreviated architecture. We actually have both AWS and Azure. We have lots of clusters. This is only half the clusters that were deployed. We actually deployed 16 total clusters across the board. The requirement was, again, multi-cloud, so AWS and Azure; multi-region, and AWS actually use multiple regions within America and then also multiple regions in EMEA as well. So if you take this and kind of double it, you have AWS, two regions in the US, two regions in EMEA, and then the same thing for Azure as well.

The ones across the top there represent the production clusters, and each one of those production clusters also had its own DR cluster within that region. That way we could support a failover in the event that the primary cluster went down. So, a lot of nodes we deployed on that. I didn't mention the fact that each one of these clusters was like a five-node cluster. But we also deployed it... When was this? Probably a year or two ago at least. We also deployed this on Consul using a Consul backend. You can imagine the number of nodes that we had to spin up to support this solution overall.

How did we actually do this? Once we gathered the requirements and all that good stuff, how did Josh and I work together? I just threw this slide in there because it's no secret, the last couple years have been tough with COVID and all that stuff. And of course, that's when Josh and I started working together on this project. Josh lives pretty far away from me, so we ended up just doing through tons and tons of Zoom meetings, I felt like — lots of working sessions, multiple working sessions each week. And of course, I didn't have access into their environment. Mostly my contributions were either shoulder-surfing Josh's desktop, telling Josh to give me control over his desktop, or just submitting like PRs through GitHub and committing directly into their GitHub. While some people were out there adopting all their pandemic puppies, we were out there adopting Vault Enterprise for Josh's organization here.

How do we do it? Get into the technical piece of it: we start with Packer. We obviously need a consistent image, especially the fact that we're using both AWS and Azure to deploy Vault. We want to make sure that we have the same version of Vault, patching, all that good stuff. Josh and I didn't do a great job of planning on here — we kept on having to go back to Packer and like, "Oh man, now we've got to add the Datadog agent. Oh man, now we've got to go add this thing." This is one of our lessons learned that we'll talk about later. But we start with Packer — obviously, creating the Packer build, running `packer build`, all that good stuff, pulling in artifacts — and then ultimately creating our consistent images for all of our AWS and Azure regions so we have consistency across the board. The good thing about this is, obviously, with Packer all we have to do to update is update the Packer build, create new images, push that across. And then we can just update our Terraform to say, "Go pull the latest build, push that up," and then we can easily upgrade our Vault clusters moving forward.

After we got done with the Packer stuff, obviously we want to use Terraform. We're not going to deploy 16 different clusters plus 16 Consul clusters by hand here. Obviously we use Terraform to deploy everything across both clouds. Basically we created a module, a centralized module for each different cloud. And then we had tons of calling modules for each one of those. It looked something like this. We had our module over here on the left — all of our Terraform Cloud workspaces, this is all using Terraform Cloud — all these calling modules in the middle here calling our primary module, and then ultimately going and deploying that Vault and Consul cluster in the region that we needed to do.

This is obviously duplicated for Azure as well, but it really simplified the deployment process: a single module that we can update and now it impacts all of our stuff, I guess, for each cloud. That's a super easy way to do it. Then really our calling module just had a bunch of variables in it. That was basically all we had, the variables. Because obviously in like AWS, it's got its own VPC and subnets and all those kinds of things that we have to identify. So calling modules are really easy, just really passing in the variable values that we wanted for each one.

»Lessons Learned

Now that Vault is deployed using Terraform and all that good stuff, what are some of the lessons learned? It wasn't perfect. I've deployed Vault a lot of times, it's never perfect. A few of the things we want to talk about, Azure and AWS are obviously very different, different platforms there. So we struggled a little bit just determining what is the best use case or best code to use for AWS versus Azure. Obviously, they have different services that would support the Vault deployment. Things like, in AWS we wanted to use Auto Scaling in order to make sure that Vault nodes are up and running, as well as using auto scaling to ensure that we can easily upgrade those clusters when we want to. We would just go to the auto scaling group or the nodes, kill one after we upgrade Packer and all that good stuff, go ahead and shut down one or terminate one, and the Auto Scaling group would automatically bring one up using the newer images that we wanted to. So pretty easy in terms of a Day 2 operation, and being able to upgrade clusters.

TLS certificates are not fun. They never are. Josh and I struggled so much. Just after Josh would mint the certificate, getting the certificate from the solution that they use, they just really try to get the formatting correct — that's what we struggled with the most. We would deploy a Vault node, and then Vault would not come up. We were looking at the logs, and it didn't like the cert. There were extra spaces in there, it wasn't formatted correctly. Oh my gosh, we fought with that so much. But we finally found a formula that worked, including pulling it into Josh's machine and deleting white spaces. It was not fun, but we got it done.

We did have a permission issue cross-clusters for logging. This was an interesting one. When you enable audit logging, obviously the Vault user that's going to be running the Vault service has to have permission to write to the log. That's something that we forgot in the Packer build or the Terraform script, in order to create basically a directory and then set ownership for that Vault user.

We ran into this where Josh calls me one day, our working session, he's like, "Everything's down." And I'm like, "What do you mean everything's down? Everything was working great yesterday." He's like, "I enabled audit logging and now all the replication is broken." What happened was, we enabled audit logging on the primary cluster, and if you ever use replication, if you don't add that little `-local` at the end, that configuration actually gets replicated to all clusters. So all the clusters all of a sudden had a, “Hey, I need to enable audit logging and I needed to write to this specific directory.” Well, we didn't set permissions up for that directory on all the clusters. So now replication was broken, because it couldn't actually apply that setting across the board. We had to go fix that. We updated Packer and Terraform, redeployed everything. It was fine, but that was a big mystery for a little while on why Vault was down.

»Post-Deployment Issues

The last part is just the combination of all those things I mentioned. We didn't do a great job of planning. I think Josh and I are like, "Yeah, let's just go. Let's get things up and running." But as we got things up and running, as I mentioned, we had to go back and like, "Okay, well, we've got these permissions for audit logging. We forgot these permissions or forgot to add the Datadog agent or configuration file as part of the deployments." So, we did have to go back a couple times and add some stuff. I think Josh had to do that even after I left — he modified some things. But overall, at this point we have Vault clusters up and running across multiple clouds.

This is where Josh will take over and showcase what was done within the environment after Vault was actually up and running. Now we're starting to set things up for consumers to consume the Vault service.

 

Josh:

Thank you, Bryan. Taking over here, where we're now in the process of Vault. Like you said, we're ready to go, let's start consuming it. The flood of people wanting to get their own namespaces: how do we manage that? What we did was — think about creating a module for this to take care of the deployment process as you would with Terraform. We're going to use Terraform Cloud the whole way through here. We want to go ahead and make sure when we deploy namespaces, make it so we can deploy namespaces very quickly for folks so teams aren't left waiting. We need to automate the process, obviously, as you mentioned, because we don't want a Vault admin to be spending time doing boring tasks.

We also wanted — we recognize that in the future, we have some project lifecycles that are likely to happen within the organization where there could be a short-term project that needs to also be a more isolated environment. We want to let that go ahead and also influence our design of our module that we're creating to then let us leverage our Terraform Landing Zone that we have created separately. To go ahead and leverage that to let it create namespaces if needed as well, so we can manage that whole lifecycle of the project where once that's done, terminates out the Landing Zone environment, and then Vault namespace gets ripped away as well. We don't have to worry about chasing around cleaning that up. So, we have that use case. Then we also think about multi-cloud, as we mentioned many times already. And as we mentioned, we're on AWS, Azure, GCP.

So, we want to go ahead and make sure we are actually going to just have a dedicated namespace for our Landing Zone, to help enable our Landing Zone to deploy all the infrastructure for all those various cloud environments with one central reference point to help that workflow proceed.

»Implementing Namespaces

One thing to talk about too is just getting into Vault namespaces. This was our first time, as he mentioned, we were coming from open source, so we had to think a little bit differently or just understand how you would go ahead and get humans authorized in using Azure AD for SSO. When we consider that, how they're going to go ahead and authenticate in, we obviously went Azure Active Directory and Vault OIDC (OpenID Connect) auth method, referencing the external groups in Vault. And that allows us to go ahead and continue on to leverage our existing Azure ID structure for groups and grant teams access moving forward.

The way it works here, like I said, is new to us because we hadn't done the namespace piece before. I just wanted to show that diagram of mapping that in your mind of, you're going to Azure ID, you grab your Azure ID object ID, add that to the external group in the root namespace. Then we say, "Okay, you want to go ahead and create a new namespace." What you then do is, you create your new namespace and then you go back and you reference the external group ID that's in the root, and you make sure you incorporate that Azure ID in the internal group within the Vault namespace. So just going to all map that together. Obviously you can do that manually, sure, through the UI or CLI, or whatever. But we go ahead and we take care of all that process within our namespace deployment module that we created, so it makes it much easier.

When thinking about how machines are going to go ahead and authenticate into Vault, we just identified a few of our common ones out of the gate that teams want to use. We're going to go ahead and just add those within our module as well to easily enable when needed. We take this process to also set default paths as needed for each of these auth methods. One of the ones is the AppRole we use quite a bit. Right now we're using it for the Terraform Cloud workspaces, to give them permission to be able to modify a namespace. We'd like to get away from that, but it's the best solution we could go with right now — quick, easy, get going.

Then we have the JWT (JSON Web Token) auth method that was used. We use that with GitHub, to enable teams to access Vault from a self-managed GitHub runner that we have. That way they can just go ahead and pull credentials into the runs without having to worry about storing those credentials in GitHub instead. gain, helping us get towards that centralized management secrets.

Then another auth method that get used quite a bit by our software development teams is the Azure auth. And they like to use the Azure auth method to manage identities in Azure to go ahead and have their applications auth into Vault to grab credentials that are needed.

Some other things that aren't mentioned here on this slide would be like secrets engines that we're using. A few others that we've started to go ahead and consume more are the database secrets engine, specifically more for Snowflake, because we have some teams there that are spinning up a bunch of Snowflake environments. We want to avoid them passing admin credentials around, so we're leveraging that secrets engine to just let them create short-lived credentials and move on. That's been helpful.

This is mentioning, like I said, the Landing Zone that we have when we're leveraging that dedicated namespace. Terraform Landing Zone has its own namespace, which is referenced during deployments of cloud accounts. It's not a complete list of everything we're doing in there, but just a few big highlights. One of them happens to be the API keys for tools so any of our monitoring tools are security tools that we're using are standardized on those across all of our other cloud environments. We want to quickly just be able to reference the API keys, pull them in during the deployments, during that Landing Zone workflow. Another thing is it just helps with naming conventions. We'll have things like department names or some of that tagging structure you might want to do, and we're keeping these naming conventions within a KV. Within the namespace we're using the JSON formatting there to go ahead and help out with that whole process — things like mapping short name versus long name, so we have requirements where maybe for a tier in an environment, whether it's dev or prod, maybe we say short name is DEV or dev, and long name development. Just within that JSON formatting, we're able to reference our requirements, our standards for those formatting, and incorporate that during deployment processes through the Terraform Landing Zone. In an interesting way, it's helpful there.

Another thing is the IP Allow List. A lot of our networking that gets deployed through Terraform Landing Zone is able to reference that, give all the IP Allow Lists that we have for certain scenarios. W also like that we can include some metadata there as well. Because, again, we're using the JSON formatting within the key value store, and we're able to go ahead and put descriptors like, maybe, “this IP address is related to this office” or “this application.” That way, once it's all deployed within the cloud environment, if anyone has to look through the console or any other teams need to reference and understand what these IPs are, there's a note there to help them out.

»Identifying Namespace Patterns

When we go ahead and do the namespace automation, we realize there's probably about three patterns we identify that we could follow. It gets broken down initially to consumers, producers, and we just call it the Terraform Landing Zone. 

Consumers are types of folks or teams that we identified that would want a little bit more help from us to manage their namespace — lot more assistance there working with the Vault admins along the way, maybe get them ramped up as they go. They would probably be the types that manage their namespace a little bit through the UI, CLI, or API. Therefore, the differentiator here is they're not maybe going to use Terraform to help manage their namespace initially.

Then the other types, like I said — the producers. We identify these groups as folks who are — they usually end up being some of our engineering teams. They're used to Terraform Cloud already or Terraform in general, and they have interest in using Terraform Cloud to manage their namespace. So we go ahead and we knew when we realized that group exists, they're not going to need too much help from the Vault admins. Vault admin, of course, is going to need to initially deploy their namespace for them anyway. But with these considerations, we knew we were going to need things like the GitHub repo, the Terraform Cloud workspace, all that sort of stuff. They'd go ahead and charge forward and use Terraform Cloud to manage their namespace by themselves. So, you have to think about incorporating those in the module.

The last pattern is the Terraform Landing Zone piece. We have a module in place, we'd like to make it easy for the new namespace to be deployed in the future. This would be an added service, thinking that when some teams deploy an account, a cloud account through the Landing Zone, maybe they can select the option to go ahead and include an additional namespace when they need. But right now, most teams are using an existing namespace that they have even when they're deploying a new Landing Zone account. There is enough crosspollination there going on. Those situations where they might need an additional namespace with an account that's being deployed might be if they're working with multiple other teams and they don't want to manage the complexity of carving up the policies and groups to come into the existing namespace that they have. 

Again, greater isolation for them.

All those influence our design that we went for.

At a very basic level, if you're going to deploy a namespace — sure, there can be some other things here, but we're initially thinking about it — when you're deploying that namespace, you need to get the Azure ID group to get access into the namespace. So, take care of that piece. You have to go ahead and get some policy templates in place to help some of these teams quickly adapt to a policy that meets — maybe for the admin of the namespace, give them a policy that's scoped to give them enough access to do what they need, but we carve out a few things that we don't want them to be able to manipulate. We also include some other things like some default configurations for secrets engines auth methods, trying to bring in some of our best practice.

»Distributing Access

One thing we like to include is, for the KV portion, we like to pre-populate it with an example of best practice on folder structure. Which then helps set them up for success with carving out policies in the future as they need to divvy up through three different groups. Because we see some use cases where teams, using a namespace, a folder structure works where it's based by application and it drills down to various aspects of that application, or also tiers, whether it's prod or dev. And then they have group structures that are, they maybe want only a group of developers to have access to development secrets and there's a whole nother group of people who can only have access to production. They help them carve it up that way.

Again, that previous slide is a more basic one; this is where we started thinking about more that needs to be added in, as we're thinking about those folks that we call the producers, where they're going to need the GitHub repo, Terraform Cloud workspace, all that sort of stuff. So after considering all those other needs of the use case, it became clear that automating creation of namespace wasn't simply just going to be using Vault provider to create namespace and call it done. In our situation, there are many other components that we wanted to incorporate and give the producers groups and Terraform Landing Zone the ability to manage their namespace with Terraform Cloud so we go ahead and go through the whole process here. We started with the simple namespace creation module, that's based up here. We carve out — essentially end up with three different modules that we then use together. We carve out the namespace module, which takes care of all that piece of adding the OIDC, Azure ID SSO integration with the external groups, and takes care of that internal-external group mapping. Also that module takes care of the secrets engine piece I mentioned — the policies, adding the policies to the group, to the Vault admin group, so on and so forth there.

Then we also built what we would call the self-service namespace module. That one is using conjunction with the namespace module as mentioned. The self-service module creates the GitHub repo, which is pre-populated with Terraform template to help get them started with understanding how they'd use Terraform to manage their Vault namespace. Puts that within Terraform Cloud and then creates the Terraform Cloud workspace, assigns group access also to GitHub, as well as the Terraform Cloud workspace. It also adds the Terraform Cloud service account to the GitHub repo so it has access to be connected for the VCS integration.

On the flow, on the right here, right hand side here, we show all that being brought together. Another module that we mentioned is the Credentials Alt Save. We're creating that Terraform Cloud workspace that we want the team to be able to use to manage, but we don't want them to have access to those credentials directly to use. So, we did a module where we just create that AppRole ID in secret, in that namespace, but we only save the credentials into our Vault admin namespace so they don't have access to it. But we then are able to reference those credentials and pull them into the Terraform Cloud workspace. And the team who gets access to that Terraform Cloud workspace, they don't get permissions to see state or anything like that. They can't access those credentials. If we need to, we can go ahead and reprovide the credentials, but we help with the rotation of those credentials for them, so don't have to worry about that.

Just real high level, a quick diagram view, because it's a lot of words and pictures are better: namespace self-service, from our perspective, we get the Terraform file — I'm doing this all from my, what I call my Vault management workspace or repo as well — I'm initially creating their namespace for them there, modifying some variables that we have within the Terraform file that's used for this, and bringing those modules. We're creating that namespace, that credential's getting created that will be used for the Terraform Cloud. We move on, create the GitHub repo. We have to create the GitHub repo first, because we then need to move on to creating a Terraform Cloud workspace, which needs to reference the GitHub repo. e make that link happen, and then we apply the credentials as mentioned, the AppRole credentials, to the workspace. And then off to the races. The team now has access to their namespace and the ability to manage it through Terraform or Terraform Cloud.

Some lessons we learned along the way, some tips here — probably straightforward, but just like to call them out anyway. The policies, when we're talking about giving a policy to what we're calling the namespace admin, we found we should really omit the ability for them to delete a mount path, specifically trying to avoid them accidentally deleting their whole KV. We do give them permission to create mount paths, but they have to have a certain path structure, initially. I don't want them to go crazy and just create as many mount paths as they want. Trying to put a little control in there. And this is a product of also coming from Open Source, where we had to just create tons of mountain points — got a little excessive. We want to try to keep that a little bit more grounded here.

Another thing, to move on to the second bullet point here, is thinking about, as you're creating the module, really leverage `depends_on` metadata there. Like this example is showing, again where I'm talking about the Terraform Cloud workspace being created, obviously no sense in creating it before the GitHub repo's created, and then some content within the GitHub repo that we add. So, we put a `depends_on` there for that stuff to finish up being created. Then Terraform can move on and deploy the workspace piece.

Final lesson was just — we all come across it as we're working on stuff. banging away — we just came across a situation we had to version-lock to a specific GitHub provider version, because we were running into troubles when we were trying to do these deployments. We just add a reminder to go ahead and circle back and check if the bug gets fixed.

Last thing here for lessons learned would be, we're handling creating these namespaces for teams and we're adding the groups in. We haven't quite opened the loop up well enough to let them add groups as they want. May be a good thing for now, but we'd like to progress to that in the future. We want to let teams grant access to namespaces, their own namespaces, if they need to. Right now with that hierarchy, you have to drop the AD group into root first and then doing all that referencing and getting into namespace. We're not going to let teams manage anything in root of Vault, because obviously you wouldn't want to, so we just have to work through that and find the best solution for it. We've really just got cruising with this. I'm sure we'll find a better solution, potentially in the future here and pivot to that.

»Provide Customers With Examples

We find that's helpful, helps them starting them down the correct path. Like I mentioned, the KV structure, the policy structures, those sorts of examples, which help them go ahead and start setting up Terraform Cloud workspace on their own, all those sorts of pieces, all that code that was included for them. We found that that got them going quicker and coming back with less questions, which was good.

Part of the process is, you're creating namespaces for these teams. Try to identify the teams who have the ability to take on some more ownership, and identify individuals within those teams to associate them with being the admin of those namespaces. And then work directly with them to get them ramped up as needed.

Small point would be training on namespace login. Again, when you come from Open Source, the namespace is totally new. Just simple things like UI: "Hey, when you log in, you're not going to be able to log directly into your namespace, not with the way we've designed it here. You're still going to have to log into root and then pivot into your namespace." So there's that aspect. It was common at the start for teams, even who are using CLI or API, to go ahead and try to grab credentials. They get the error, “Hey, you don't have access to these credentials, they don't exist.” That would be because they didn't put that switch in there to go into their namespace, so they're just trying to hit right into Vault itself. Just one of those small things.

Then the last piece would be, migrating from Open Source to Enterprise is not easy. We're still underway with it. We're getting pretty close to the end of it here. But yeah, it's a slog in my experience. Bringing over secrets from KVs from open source to Enterprise — there's some scripts out there that can help out with that quite a bit, which is great. That's a good starting point. But then think about everything else that's involved — the groups, the app, any of the auth methods you're using, other secrets engine — just going through getting that set up. And then really just getting the attention of all those other stakeholders and helping them migrate their stuff over, or determine the process that they wanted to migrate over. Quite a bit there. We're making progress with it as we get attention from other teams to go ahead and focus on it.

 

Bryan:

I would say these things right here, I think the bottom four, these are people problems. They're not technology problems, they're process problems. Any time you're thinking about adopting namespaces in your organization, these are key things you have to think about. Providing them with examples — that was the whole thing where I was talking about standardization. We stamp out a KV, we stamp out policies like a namespace admin policy on every namespace, but we also provide examples. That leads down to the next two. Identifying the namespace admins — who's going to manage the namespace as a whole? Finally, you have to train those people, make sure that they understand how to use Vault. If you're going to hand over a namespace to a development team or a namespace admin, they need to know Vault things. They need to know, how do you work with the secrets' engine or an auth method? 

»People Issues

Or how do you write policies that are going to be applicable to your applications? Things like that. Those are, again, all people problems that you have to solve. They're not necessarily technology.

I would say even the bottom one is very much — Josh was upgrading from OSS to Enterprise, but it's not like we were — again, it's not a technology thing. We deployed all new clusters for Enterprise, so it's not like we're worried about swap out binaries and all that kind of stuff. A lot of it was like, yeah, we're going to migrate secrets from Open Source to Enterprise. But a lot of that is convincing the development teams that you should be adopting this new solution. You're already using Vault, you love it, your applications are dependent on it. You need to move over to this new deployment that has complete failover, redundancy across multiple clouds, multiple regions, all that kind of stuff.

So we put together onboarding docs for developers. I was trying to push Josh, I'm like, "You've got to put a timeline on here. You've got to say like, 'We are shutting Vault Open Source down on this date. You must move over to this new solution.'" Because then it's a burden on the Vault operations team. Now we've got this new cool, a whole big replica set across multi-cloud and all that kind of stuff. We can do whatever we want with our new deployment. But guess what? Now we've got to manage Open Source right now because my developers won't come off of Open-Source and move over to our new stuff. So again, it's not a technology problem, lots of this is people, process problems that you have to solve.

 

Josh:

Absolutely right. That's a valid point to call out. If we were just migrating straight from Open Source to Enterprise, there is an easy migration piece. But where I think the differentiator made it tough for us is that we're adopting namespaces. We're no longer wanting people to use their traditional way, where policies were all based on scoping of their pathing. We're saying, "Okay, you don't need to..."  — it's like team name, whatever, so forth. And that's all in the root of Vault in open source. We're saying, "No, we're not going to have anyone live in the root anymore. We're going to put it all in namespaces." That, therefore, causes a lot of need to change their…

 

Bryan:

How you access that is totally different.

 

Josh:

It was very good though.

Well, thank you all very much. And I also just have to thank Bryan and his crew. HashiCorp helped out a lot. Actually, right here sitting in front worked with us too. All very helpful. And then, of course, teammates that I work with — all the module stuff, another teammate that cruises through our Terraform Landing Zone piece. He was a big contributor to getting our module flowing there and working well, and then I just Frankensteined some of his parts together with his approval. So, that's what we got.

Well, thank you very much. We'll actually be out back there, out in the hallway, Bryan and I. So if you want to ask any additional questions, whatever. 

Bryan:

Thanks, everyone.

More resources like this one

  • 1/5/2023
  • Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

  • 12/22/2022
  • Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

zero-trust
  • 12/13/2022
  • White Paper

A Field Guide to Zero Trust Security in the Public Sector

  • 12/12/2022
  • Demo

Deploy Resilient Applications with Service Mesh and AWS Lambda