Learn how Target manages and maintains its enterprise deployment of HashiCorp Vault from unattended builds and automated maintenance, to compliance and and client onboardings.
This is the story of how Target adopted HashiCorp Vault for centralized secrets management across its teams.
Hello and welcome to "Managing Target's Enterprise Secrets Platform." I'm your host, Shane Petrich. I'm part of the digital certificates and cryptography services (DCCS) team within the Cybersecurity Solutions Group at Target.
I've been at Target for 9 years. The last 6 of them have been with the cybersecurity group. Target's made up of a little over 1,900 stores. We have 46 distribution centers, and we're headquartered in Minneapolis, Minnesota. We also have a global capability center in Bangalore, India.
A little background on where Target was. About 2013, Target was embracing DevOps and agile methodology, moving away from waterfall styles of managing our platforms. Teams were automating their processes and their platforms and removing manual intervention, which was introducing too many problems. We started to expand into cloud and introduced new foundational technologies and new ways of doing work.
Our teams were building these environments and coding their applications in new ways and having to store their secrets in different locations. Some of them were putting them in their Git repos. Some of them were putting them in their CI/CD pipeline.
Target Tech noticed that this was a problem and wanted to centralize our secrets management. The leadership approached the DCCS team, my team, to solve this need across the enterprise and develop something not only for the new way of doing things [greenfield] with DevOps and agile methodology, but also for the legacy work that we still had all around the environment [brownfield].
With projects like this, we start off with requirements. We're part of the cybersecurity team, so we naturally gravitated to the security requirements first.
We wanted to ensure least access privileges. We wanted to have a centralized authentication authority that we could tie into, and we needed to run an audit log. We needed to monitor all the activity and get notified if there was a problem.
We had to be stable. We couldn't have a platform that was going to be going up and down, unavailable one time and available another time. We had to make sure that it was always there. The idea was to minimize complexity; the fewer things that you have, the fewer things that are going to break. We wanted to make sure that that was a requirement of this project. Availability went right along with that. We wanted to make sure that we were available even when we were doing maintenance.
If there was ever an outage somewhere, we had to make sure that we could still handle the entire workload for all of our customers at a different location.
Lastly, we had to be compliant. We have PCI requirements, Sarbanes-Oxley requirements, HIPAA, etc. All of our teams are required to meet these requirements. We didn't know what they were going to store in our environments, so we wanted to make sure that we could cover all those needs.
It all boils down to: We need to keep this simple. We need to automate this. We need to make it so that it's not difficult for the company's engineering teams to consume and our own group's engineering team to support.
After we settled on the requirements by working with all of our business partners, we got to work on the buildout.
We started off asking, "How are we going to build this?" We looked around. At that time, we could only build on a compute node. So we went to our partners in infrastructure and started procuring different servers and things like that. We wanted to make sure we were utilizing standard offerings. We didn't want to start blazing the trail, because this had to be something that was available to everyone.
We also wanted to make sure that we had all this stuff automated. At the time we were looking at just a few nodes, but we knew that we would grow. It's easier to build and script on 1 node and then replicate that to everybody else versus trying to do a manual process on every single node. That's going to have some problems.
We had to build this stuff very quickly, so we had to fail fast. We would approach a problem and work on it. If it just wasn't going to happen, we'd have to pivot and move another away. For example, we couldn't use Terraform's Vault plugin because of deficiencies we had. So we pivoted away from that and put it back in our orchestration and then built up our orchestration to handle everything that the plugin was going to handle for us.
After building this platform and this service, we started taking on customers. We started off with those customers that we were engaged with on building the requirements at the very start. We would bring them in and manually create their access and their policies so that we could learn what the process was, but also so they could start consuming it.
As we were working through those onboardings, we started to realize that these are just repeatable patterns. We're doing LDAP and secrets key-value. And we realized all this can be automated. So we developed an automation process. We accept requests from our customers and we use our CI/CD pipeline to execute, test, and execute the onboarding. A user would come in and tell us what they want for their policy, and we would test it out, make sure all the data is good and that the groups are created, and we would do all those validations. Then, after we've reviewed it, we'd publish it. Our CI/CD would execute that job and add that policy in that space to our platform. Then teams can start working and consuming that service.
The automation was great, but we realized we needed to help people. We can't put something out there and just have it work. We spun up a chat channel within our internal Slack and invited people to it, saying, "If you have any questions, you can engage with an engineer." As we got more into things, we spun up an Open Lab. Now we have a couple of hours every week when we invite anybody and everybody to have a one-on-one talk with one of the engineers that support this platform.
We also developed a full document site for those that don't need a close partnership. They can go right to the document and read about how to onboard, what's required of them, things like that. We also give them examples: "Here's how you consume Vault if you have the Vault binary." "Here's how you consume Vault if you have
curl or if you’re just hitting the API." There are several more that are commonly used throughout Target. We make that publicly available. Hopefully people can find the answer, but if not, we're always here to help out afterwards.
As we were continuing to grow the platform, we found new things that we wanted to include in our automation. We were finding more teams that have regulatory requirements, and we learned that the compliance team needs to review those teams' requests for location.
We modified our onboarding to check whether the application needs that second level of approval. If the application that you're using secrets for falls under regulatory rules, the compliance team gets engaged, and the onboarding is paused. If it's approved by the compliance team, we approve it. And there you go, you have your location.
Now that we have teams that rely on us and use our platform to build their services, we need to make sure that we're always available. We're building up our monitoring. We start off with the basics. Are the servers available? Are they up and running? Can it ping? Are the CPUs overutilized? Did we take away all the hard drive space? Did we use up all the hard drives? Is Vault running? Is Consul running? Are the needed processes actually there? We start off with that base monitoring.
Then we move into, What does the application give us? Vault has metrics in it. How long did the login take? Who is active in your cluster? Who is sealed? Those are metrics that you can then harvest and display on a dashboard. Consul has metrics too. Who's a cluster leader? What was your read/write IOPS? Those are important things to be looking at to make sure that your platform is always running successfully and stably.
Next, it was learning from events. All of our monitoring at that time had nodes telling us that the platform was OK, telling us what the platform already knows. You need that, but sometimes a user cannot get to your platform because the platform's down even though your platform's up. So we developed synthetic testing. We have remote nodes in different locations that execute activities against our Vault clusters. They tell us whether it was successful or failed. How long did it take? Did it take too long? What part of the process failed? Was it at login? Was it at write? Was it at read? We get these kinds of metrics to truly tell us what the requester is experiencing. We can trend that, watch for it, and alert on it if it goes bad.
We have also learned from our root-cause analysis and from other teams' root-cause analysis. As everybody experiences outages, we learn from them; we analyze and see what happens. We also learn and analyze from other people's incidents. So if a foundational service fails but your service stays running well, why is that? Maybe those learnings could then be shared with other teams.
Once you've got your platform up and running, you've got it monitored, you've got onboarding that's automated, you've got all this stuff that's working, you've got maintenance. Everybody has to make sure that stuff is kept up to date. How do you deploy that new version of Vault? That new version of Consul? That new version of whatever other software you might have running?
You need to have this in mind back at the build stage. Remember, you're going to have to do this eventually. How do you upgrade? Are you going to just update the config or do a teardown and buildup?
You have to work through those processes. Try and make it as easy as possible, because you're going to do it a lot of times. And not only are you going to do it for the application, but if you're like us and you're running on servers, you have to patch those servers.
We have to do certificate management. My team is responsible for SSL certificates at Target, so certificate management is near and dear to our hearts. We want to make sure that the rotation of those certificates is not only automated, but seamless to the user. A customer should not have an outage just because we're doing maintenance like this. A lot of this comes down to how you have your platform built. Can you take down certain parts of it in a structured way and not impact your users?
Lastly and most importantly is maintenance for your team. Our platform is successful because of the team that we have behind it. So the care for those team members is paramount: engaging them, getting them excited. Things like HashiConf. Things like your local HashiCorp user groups, where you can learn new things, learn a different way of doing something, or just bounce ideas off of somebody. They're invaluable. Online resources, YouTube. Everything that we have is on YouTube.
Customers themselves have pushed us. We have very smart people and they ask us, "Can it do this? can it do that? What about this over here?" And we say, "Those are great ideas. Let's partner together. Let's work on that and see where we can go with this." Or an engineer might have an idea and just explore, maybe asking, "What does this new feature in Vault do?"Or maybe they will develop a custom plugin. Fostering that and giving them the time really pays dividends on your platform.
Thank you for your time. I appreciate it. Have a great day.
Terraforming RDS: What Instacart Learned Managing Over 50 AWS RDS PostgreSQL Instances with Terraform
Running Windows Microservices on Nomad at Jet.com
Transforming the Management of Application Configurations & Secrets at 24 Hour Fitness
Vend: From Monolith to Microservices, with Consul and Amazon ECS