Keep It Safe, Keep It Secret: Vault and Consul at Kickstarter
Jun 04, 2019
Learn how Kickstarter uses Vault and Consul to manage secrets securely with high availability.
Kickstarter is a global funding platform for artists, musicians, filmmakers, designers and other creators. Every 5 seconds someone in the world backs a Kickstarter project so the site has to be available 24x7. To date more than 16 million people have pledged money to support ideas and the site has processed over $4.3 billion in pledges to projects. Conducting transactions with that frequency and at these volumes makes security absolutely paramount at every stage in the development and production cycle.
In this talk, Kickstarter Cloud Ops engineer Natacha Springer will explain how Vault’s PKI backend is deployed by Kickstarter to ensure data is encrypted during transit and how both Vault and Consul integrate seamlessly with the company’s wider IT infrastructure.
In addition, you’ll gain an understanding about how both HashiCorp products keep Kickstarter’s lean and distributed Ops and Engineering teams fully agile, while maintaining high security standards and high availability—thereby allowing the company to scale its IT infrastructure and app development as it continues to grow.
Remote Cloud Operational Engineer, Kickstarter
Keep it safe, keep it secret. You could think that would be a Vault talk but the engineering team really wanted me to give it this name. It is a reference to a quote from the movie The Lord of the Rings. My talk will give you an overview on how we are using Vault and Consul at Kickstarter.
My name is Natacha Springer. I am currently a remote cloud operations engineer at Kickstarter and previously I was working at Dow Jones as a DevOps engineer where I started in the internal tools team and then moved on to building the tools to deploy the Wall Street Journal.
» Introducing Kickstarter
For those who don't know Kickstarter, it is a funding platform for creative projects. Somebody has an idea for a comic book, a toy, an art festival, an open source project, a video game and they come to Kickstarter. They put a project up on the site. They can offer rewards for various pledge levels and then their family, their friends, strangers on the internet come and give them money. At the end of the deadline, if they've reached their funding goal we process the transaction and the creator gets the funds to start their project.
To give you an idea of the scale we are working with every 5 seconds someone backs a Kickstarter project anywhere in the world. Close to 13 million people have pledged towards new creative ideas so far — and less than a month ago we crossed a total of $3 billion to independent creators in pledges on Kickstarter. These stats were correct at the time the presentation was given, but are rising all the time — the most recent ones can be found here.
Another surprising fact about Kickstarter is that our engineering team is pretty small. We just counted—I think we are 24 engineers with 6 managers, but our engineering team is truly agile and also very diverse, as you can see. The Ops team consists of 3 engineers and a manager and we all work from a different state. We're completely distributed.
» Managing secrets
We're moving from an 8 year old Ruby on Rails application, to a microservices oriented architecture. As we're breaking the monolith, we quickly realized we also needed to build the tools and the systems around a service oriented architecture. We followed the 12-factor app methodology and we already had a strict separation of our conflicts from our code. That was easy.
We started looking for solutions to manage all the secrets and configs and we simply loved the fact that Vault, a security tool developed by HashiCorp to do exactly this—managing secrets—is entirely open source. As a security tool, it truly benefits from the developers’ community scrutiny.
As we started setting up Vault, we realized it was important for it to be completely resilient and scalable. Our initial goal was to set a stateless Vault cluster. We are also big fans of chaos engineering and killing a Vault instance should not be a big deal. Vault infrastructure involves putting an elastic load balancer in front of an Auto Scaling group of instances. After looking at options for Vault storage backends, we decided to pick Consul.
When running in high-availability mode—as we are—Vault servers have 2 states. They can be either active or standby. For multiple Vault servers sharing one storage backend, only a single instance will be active at all times while all the other ones are hot standbys. The trick to setting up an ELB in front of Vault, is to use the V1/sys/heath route. This route will return a 200 response code only if your Vault instance is active and unsealed. Meanwhile, an instance that is a sealed would return a 503. A client request is then always forwarded to the active Vault instance with this ELB health check. As you can see here, we're running a cluster of 3 instances and only one is active at all times.
» Controlling access
Currently, if our Vault cluster auto-scales, an Ops engineer has to manually unseal Vault and we'll get an alert to do so. This is on purpose. We chose not to automate the unsealing process as we feel that it is the best way to keep our secret infrastructure fully secured. Our entire Vault infrastructure is made in a single CloudFormation template that we keep in GitHub.
A nifty tool that we built is a Vault client script for engineers to access secrets inside Vault. Once the script runs, a Docker container is built that will establish an SSH connection with Vault. This script leverages the GitHub health method offered by Vault. It will authenticate users on whether they are or not members of a private GitHub team that we have set up. This way, all our engineers have seamless access to Vault and do not need to worry about the token expiring or having to get access to that application’s secret.
Within the company, we have set some standards on how an engineer should store secrets — we have communicated these and published these internally on our Wiki. This path makes it easy to create proper access control lists for each application. We all use the same path based on service name and environment — making it easier to automate secret retrieval. We also support a global path for each service to avoid any redundancy in the data entry process.
Storing our secrets securely was important, but we needed to make sure that the data would be encrypted during transit as well. So far the best method out there for transit encryption is TLS. We are a small team and we do not have an engineer completely dedicated to security or to maintain an entire PKI.
One of Vault's best kept secrets—and maybe its most underrated feature—is its PKI backend. I think most of you will agree with me, encryption is hard, but Vault actually makes it pretty easy.
It acts as a certificate authority and issues and maintains certificates for all our servers—allowing us to enable TLS traffic encryption within our VPC.
» Backing up and maintaining availability
Any downtime in the Vault service would have a dramatic effect on our downstream clients so we chose Consul which acts as a highly available backend. This way our Vault instances remain completely stateless and can easily be brought up or down. Our Consul server cluster has 3 nodes—In an Auto Scaling group, behind an Elastic Load Balancer.
Our elastic load balancer health check is done on the leader-route to make sure that the cluster remains healthy at all times and that there is no loss of quorum. Our entire Consul infrastructure is also described on a single CloudFormation template kept in version control.
We realized we could leverage Consul to store all of our non-sensitive configurations as well and we do not yet use it for service discovery capabilities. Since Consul is a Vault storage backend, it became pretty evident that it was very important to back all that data up properly. We set up Consul backups and every hour we have a script running in a cron file that runs the command
consul snapshot save. This will make a snapshot with all the data. We stamp it with the node IP and the date and we send that to our AWS S3. This way if we need to, we can always rebuild our entire server cluster with very minimal loss of data. Just be aware that this command is relatively new — it was released a couple months ago in Consul 0.7.1.
At Kickstarter, we are also big fans of communicating with Slack and we built an Ops chatbot to allow any engineers to list, read, or write non-sensitive key value pairs such as feature flags, for example, on Consul. This way the process remained completely transparent and any engineer can see and search what has been done to the configs — everyone can see it. We're big fan of transparency and adopting a chat Ops model helped the Ops team being more collaborative and open with the rest of the engineering team.
» Getting authentication right
Now that we have a stateless Vault infrastructure and a high-availability backend with Consul, our next step was to build all the tools that would allow containers and applications to authenticate with Vault and retrieve all their secrets. At Kickstarter we use AWS's elastic container service to deploy new code to containers — and we found out that the Vault AppRole backend was the perfect authentication method for services to authenticate automatically with Vault. An Approle will present a set of Vault policies and logging constraints that must be met in order to receive a Vault token associated with those policies.
One constraint that can be set, for example, is a site or blog only allowing requests from specific IP addresses. At the time we were also doing a whole lot of work with AWS Lambda. Lambdas are serverless compute services that run code in response to specific events. In this case, the triggering event would be the creation of a service with a CloudFormation template.
You're going to ask me, "Does it all really work?”
Once we launch a CloudFormation resource, a lambda is triggered that will create a role in Vault associated with a very specific access list. The lambda will retrieve a role ID and a secret ID with a short TTL and store these on Amazon S3. We use the ECS task IAM role ID as an identifier to name our S3 directory. This way the container always knows where to look for its credentials to get a Vault token associated with all those constraints. Once a container boots up, it is then able to source its environment variables and make them available when the service starts.
» How does the container get its secrets?
We needed to find a way for a container to have access to all its key-value pairs as soon as the service starts. To achieve this, we put all our logic into a Docker script. Once a container is spawned, all it knows is its environment and service and it has access to its AWS metadata. A container can query its metadata and find out what its IAM role is. It will then allow it to retrieve its secret ID and role ID from S3 and use its credentials to get a Vault token.
We are also using another lightweight Unix tool — envconsul. Envconsul is a great tool that allows applications to be configured with environment variables.
We use envconsul to retrieve and create a file with all of our key-value pairs from both Consul and Vault and then simply source this file inside the container environment. If you remember, we also have this global path set up— envconsul makes it pretty easy to do this and it's pretty good at respecting this since any environment specific value will always take precedence over the value set in the global path.
There you have it—the container environment is populated with all your key-value pairs and made available to your service as soon as it starts.
» Monitoring key metrics
The last thing I wanted to touch on is how important metrics and monitoring are to us. It gives us complete visibility of our entire infrastructure and gives us an insight on how our clusters and servers are doing at all times. In both Vault and Consul, we use the telemetry configuration option that sends all the metrics to a local StatsD server. Those metrics are then picked up by the Telegraf agent, which sends them to the InfluxDB server. And we then visualize all the metrics with Grafana. Telegraph, Influx and Grafana are all open source tools. We do do love open source tools at Kickstarter.
This is an example of what our Vault’s Grafana dashboard looks like — where we check for key metrics indicators. We also wrote a few custom metrics and we use them to check for sealed servers, if everything is okay with the backend, if there is a leader and so on. We really enjoy working with Vault and Consul at Kickstarter, but I definitely must admit that the frequency of releases of new features keeps me on my toes!
This was a broad overview on how we use Consul and Vault at Kickstarter. I also wanted to give a shout out to my amazing teammates, Kyle, Aaron, and Logan who are here today. If you have any questions, you can hit me up on Twitter at @DevOpsNatacha.