Case Study

Scalable secret storage with Vault at athenahealth

Find out how athenahealth replaced 'build' with 'buy' for their secrets management security platform.

»Transcript

Thank you everyone. I'm Daniel Leach from athenahealth. At athenahealth, we're responsible for medical records management. We offer a suite of online-based applications for medical organizations focusing on electronic medical records, check-in systems, billing, patient portal, Telecare. Basically, the entire suite of applications that you might use in a patient experience — or a doctor's experience — in any kind of medical practice. Typical customers of ours are going to be primary care providers, specialists, and urgent care facilities.

»The challenge of managing medical records 

Managing medical records is quite a challenge. You've got all the regulatory requirements, the doctor, individual practice requirements, and just having that general known security and trust that you have between a patient and your practitioner. 

So, imagine how difficult it can be to ensure your secrets management on top of that for our application.. We’ve got to make sure our secrets are meeting the business needs and the regulatory requirements in making sure those secrets are absolutely secure. This is a common story, I believe, where you have service accounts, application secrets, keys, TLS certs, emergency access accounts — and you configure them all automatically in some fashion. I hope so, at least, maybe. 

You're using configuration management like Puppet or Ansible to deploy those out there. Loading in SQL scripts for your databases, running bash scripts, Python Power, Allscripts — all kinds of tooling to ensure your secrets are deployed, up to date, and sold on your systems. You're leveraging tools like Terraform and Cloud Formation for your cloud secrets management as well. Or you're like us, and you use all of those — and you probably have separate teams for managing service-specific secrets deployments as well.

So, imagine the situation where you need to rotate a secret — a credential — a web application, talking to a database. That means you may have to open up tickets with two, maybe three teams — if it involves a cloud provisioning provider as well. 

You’ve got to worry about the application code, the database users table, load balancers — and they all have to be distributed at the same time to ensure you've got minimal downtime when that secret is rotated. 

»Athenahealth’s custom-build secrets management solution

This was our exact situation. Athenahealth had a custom-built solution for managing all those secrets at scale. The problem was that it's purpose-built for our core application. That means it wasn't relevant for infrastructure code or our secondary, tertiary product suites as well. It was massive, complicated, requiring physical servers and frequent maintenance. In addition, we had a dedicated development team just for this service. 

Of course, as I mentioned, the tickets: There was a constant barrage of them — to be able to maintain them and keep secrets up to date, rotate them, and track them. And to support our developers with ensuring they could be rotated within an appropriate time period. 

Downtimes were frequent — both planned and unplanned — and again, not applicable to our infrastructure secrets. We had to have a totally separate system from maintaining infrastructure secrets or — the horror story — having them right in your raw code. Thankfully, encrypted, hashed, not plain text, but still right there within our code. 

To rotate that, it's a multi-step process: Change the code, open a pull request, get it peer-reviewed, make sure it passes all the tests, deploy it out. Oops, broke. Didn't work. Let's figure this out. Change the secret again. Finally, we're working. How many steps does that take? A lot. Oh, and you had to have somebody with actual access to author that code, deploy it, check it, and then deploy it out on the system as well.

»The search for a unified solution 

This is when our developers and our site reliability engineers got together to start investigating a unified solution. It needed to be scalable, reduce complexity, and be easy to operate — have some form of self-service capacity to allow developers to maintain their secrets on their own. 

Vault is one of the clear choices for that. One tool, small HA clusters. And the best part is the self-service capacity. Remove the barrage of constant service tickets that are being opened up with our engineering teams.

»Vault at athenahealth

This is obviously pretty simplified. But each of our datacenters has its own cluster of Vault application servers and also a dedicated Consul cluster for backend storage. You'll also note that the two arrows on the top, there are also additional Consul clusters for service discovery. So, if we provision out a wide net of Vault application servers on top —  leveraging Consul service discovery — we can get this aggregated together for availability and easy access for our application systems.

This allows for easier user interaction and, most importantly, low latency access for our applications by ensuring that we've got clusters in each of our datacenters. We also have additional clusters in each datacenter for disaster recovery. This is a local disaster recovery, and to utilize it requires a manual intervention by our team to trigger that failover. And that's purposeful. We want to make sure that we don't have any opportunity for a split-brain capacity. 

In fact, that's not our design — that's actually HashiCorp's design here to prevent that issue. If you have a split brain, we don't want to have leader election happen automatically. You want to use this cluster with intent if you really need it. We actually use the  DR cluster when we're doing major cluster upgrades. We'll flip over to the local DR, upgrade the existing cluster, flip it back again, upgrade to the DR cluster. I'll talk about upgrades a little bit later, too. 

Speaking of failover, if the local cluster goes down, we've configured our internal applications to fall back to a secondary datacenter. This automatic mechanism doesn't come with Vault natively right now. It's something we built for our own applications, both defined in our own application stack code as well as in the Vault Agent configuration. I'll talk about that a little bit more later, too.

But we have a secondary capacity to fall back to a totally different datacenter. In our own testing and actual real use cases for various localized outages, this has been a seamless transition — to have our clients talking to a secondary datacenter. It introduces a bit of latency, especially because we're talking to a different region, but it's been non-impactful for our users. They don't even notice it, to be honest. More often than not, if we've got a separate localized outage, they're more concerned about that thing. The last thing I want them worried about is the ability to fetch secrets.

»Leveraging internal Raft storage provided by Vault

In earlier slides, I discussed how we're leveraging Consul as our backend storage. Guess what? That's old. We're not doing that anymore. I finished this slide deck very early last week. Since then, we have actually torn out that backend Consul storage for most of our clusters: We've only got one left. 

We're now leveraging the internal Raft storage that's provided by Vault. That's been newly recommended as of —  I want to say, almost a year ago — at this point. Based on the new recommendations, we said let's give it a shot because if I can remove the backend Consul storage, that reduces our complexity significantly. 

We were running a hundred servers total for both Vault Enterprise servers as well as Consul servers for our backend storage. Cut out all the Consul. Now down to 50 —  almost. We're not done yet. Knock on wood — it still goes well. 

Reducing that complexity significantly helps with the management experience. But it also has had an impact on our performance and scale capabilities. You'd think that by removing this dedicated Consul cluster that it probably had a negative impact: Customer-wise for our applications and developers: Zero change in their observation of latency and impact on performance.

And from the management perspective — from the actual Vault systems systems core resource utilization — whether it's CPU, RAM that has decreased because we no longer have to dedicate additional CPU cycles and resource allocations towards pointing back in storage to Consul. That gives us much better capacity for scaling it when we need to.

»How do we connect? 

Now, talking about client access, how do we actually connect? Most popular, I think, based on what I've been reading out there, is Vault Agent — and it's typically what I would recommend as well. The agent is a sidecar daemon that runs on your client servers, and it establishes what specific secrets that server has access to. 

You've got a couple of authentication methods. We're leveraging the AppRole authentication type. That gives each server specific access to a dedicated namespace or subzones within that namespace. You can get really granular about what secrets that particular system has access to. 

This agent creates a local HDP API, which runs right on the localhost. If you want to do a quick implementation of Vault, you don't have to retool all of your application coding to handle talking to the native API. You can leverage the local API much easier because you've scoped down your access right there at the server level. It gets you up and running a little faster, I think.

Then, as I also mentioned, you can leverage the server API as well — have your application stacks talk directly to the Vault API. You can kind of see from the example URLs at the bottom there — on the right-hand side — we're leveraging Consul for our service discovery. 

We point our API-using systems to the Consul service address for Vault itself. That keeps the URL management easy. You don't have to find individual servers. You just say, talk to Vault in my location. Done. Let Consul handle what the write API address is or what the read API address is. It figures it out for you — keeps it easy for our developers.

»Why would you use the agent versus the native server API? 

It comes down to your application’s needs. As I mentioned, easy access and access management and just getting up and going quickly. The agent makes a lot of sense. The agent also handles some caching capabilities. So, if you're doing frequent requests for secrets, use the agent especially — that's going to have some better performance implications for you. You're not hammering your server APIs as much for that.

But let's say that you're not doing frequent secret requests. But you need to have those requests fulfilled very quickly — like sub-millisecond style, for whatever reason — depending on your application. Then, the server API we find is actually a little bit better. More performance in that case. You'll get your requests served much more frequently. Again, it depends on your particular use case. We use both. It all depends on your particular application needs.

Pretty much every major configuration management solution has a function or provider for utilizing Vault as well. Looking at Puppet here. Obviously, we're a Puppet shop at athenahealth. This is now the second example I've used for it. 

You can easily implement this particular example Linkerd to the Puppet Forge module today. That's where the QR code will take you. This particular example utilizes a Puppet function, which interacts with a locally installed Vault agent on the target node and fetches that desired user password. Pretty simple.

»Using Terraform 

This wouldn't be a HashiCorp presentation if I didn't have an example of a Terraform, would it? You'll note here that we are not reliant on the Vault agent for this — but instead targeting the Vault server address directly. 

Tokens and access specifics can be defined as environment variables associated with your Terraform deploy. Here, we've set the Vault connection details at the top in the provider stanza and then fetched the desired secret from the defined path as part of the data source. Then, injecting the captured credential in the provider block for the desired service configuration. In this case, Rundeck.

»A typical operation from the developer’s perspective  

Now that you've got your infrastructure wired up and you're fetching your secrets from Vault, what does a typical operation look like from the developer's perspective? It should be this from now on — just your Vault UI. 

Let's say you want to deploy some new SSH keys. The owning team for the service can just log into the Vault UI themselves. Update the keys. No ticket required. Self-service. If you are using configuration management like Terraform, Ansible, or Puppet — since this has now been updated in Vault, the next time those runs execute and they're sourcing your secrets, just like I did in the previous examples for Terraform and Puppet — then those keys will automatically be deployed out the next time those tools run. 

This saves so much time, removes the backlog of work — what may have taken a 1-2 weeks — wait for the subject matter expert to come back from PTO, let the developers do their developing and move quickly. 

We limit the scope of the secrets as well. Leveraging role-based access control. That is a key feature here in Vault Enterprise. Really make sure that the scope of management for developers are purposeful — that it's limited to specifically the secrets they need. 

»What's the operational hit? 

That top number — 90 million secrets transactions per day. That's what we're currently doing as of last week, and as you can glean from that bottom graph, it's increasing. Anyone with a magnifying glass can see in the Y axis there that I'm only probably peaking about 60 million. That's one location and one cluster. All of our clusters combined meets that 90 million. 

We've got over 2,000 developers at athenahealth, and they're all hammering away with our applications and frequently filing requests. Oh, no longer frequently filing requests for updating secrets. They do it on their own.

How many engineers are responsible for doing this? Not even three. One of them is sitting right here. Yes, three. I'm going to embarrass you a little bit. He's a guy that's actually handling our re-architecture and our agent and server upgrades right now. 

We get to share the responsibility of maintaining these clusters with plenty of other SRE-related duties here at athenahealth. But the time that we have to dedicate on this to limit the scope of dedicated subject matter experts is wonderful, trust me.

»What's next on our journey? 

Always version upgrades. We're in the middle of it right now with the re-architecture to remove the Consul storage, but those have traditionally been relatively painless. I know I’ve jinxed myself because we haven't totally finished it yet for the new architecture. I don't see anything of actual real wood here to knock on, but I'll pretend it is. Cross my fingers that the final upgrade goes well. But you want to make sure that you're doing frequent upgrades. We find that we do it probably about twice a year. Helps keep you in active support with HashiCorp — keep them happy, too. 

Obviously, I'm very excited about our reduced complexity and transition to the internal storage — removing that operational headache. And this is the best practice, as documented from HashiCorp as well. You want to make sure that you're following industry's best practices and standards. I find that always useful regardless of your scale. 

»New features we're very excited about 

In this morning's keynote it was discussed about secret sync to cloud services. In particular, I'm very excited about being able to leverage Secret syncs to AWS — to AW Secrets manager. 

It’s currently in beta, so we're not utilizing yet. We've got a big footprint in AWS, and my developers keep saying how can we do this in the Secrets Manager and not use Vault? And I've always been saying, well, we can put Vault in AWS — and do it that way. That it's on our roadmap. We want to do that, too. All four services there. But leveraging secret sync is easily going to bridge the gap.

Then, from the previous conversation, if folks were at the conversation just before me here, you heard about certificate renewals and leveraging the PKI engine. That's on my roadmap, too. I'm so excited to be able to do so: Doing anything we can to automate certificate management, that's very big on my priorities.

Thank you everyone. We've got a case study we published with HashiCorp. This is a little bit old. I do have to update it. I think it was last published about a year and a half ago when we were first starting our journey. But give it a good read. It gives a deeper dive into the conversation I had here today. Thank you for attending, and I'll be down here for questions.

More resources like this one

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 1/5/2023
  • Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

  • 12/22/2022
  • Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

zero-trust
  • 12/13/2022
  • White Paper

A Field Guide to Zero Trust Security in the Public Sector