How Snowplow Automates Private SaaS Infrastructure Across AWS and GCP at Scale
Jul 09, 2020
In this case study, you'll learn how Snowplow analytics uses the entire HashiStack in their private, multi-cloud SaaS model to automate setup, upgrades, and monitoring for 150+ big data pipeline deployments.
Snowplow provides a fully managed service solution where each client has their own individual systems stack that is 100% isolated from every other client and in their own cloud. We need our large (and growing) estate of configurable data pipelines to be consistent, up to date, and secure. This talk will discuss how Snowplow leverages the HashiStack to automate the setup, upgrade, and monitoring of over 150 big data pipeline deployments across multiple clouds through our Private SaaS model.
*Josh Beemster: *
Today we're going to be talking about how Snowplow Analytics manages private SaaS infrastructure at scale, across 150+ Amazon and GCP clients.
My name is Josh Beemster, and I'm the technical operations lead at Snowplow Analytics.
*João Luis: *
Hi, I'm João Luis, site reliability engineer at Snowplow. I write Terraform code.
*Josh Beemster: *
Snowplow Analytics is an open-source data collection platform. Our product lets you collect really granular, high-quality event-level data from all of your websites, apps, server-side tracking—anything you want to track, you can track with Snowplow.
We're not going to get into all the amazing things you can do with Snowplow, but we'll talk about the fact that it's composed of lots and lots of different systems.
From an orchestration point of view, this is quite complicated. We've got a lot of things that we need to stitch together. From collection, through data validation and enrichment, storing that data, modeling that data, and then analyzing that data, we help organize all of that.
Defining Private SaaS
To complete the picture, we're going to talk as well about what private SaaS is, and how it looks at Snowplow. Private SaaS is like SaaS, but rather than having a multi-tenanted environment, every client has their own environment, every client has their own stack, everything unique, everything sandboxed.
That means that every client has different volume and different scaling requirements, they have different peaks, different troughs, different requirements entirely. Every single one of these 150 clients has something different. They've got different components and different combinations of components.
It matters what they're trying to solve, and it all varies. To add to the picture, all these environments can fail at the same time, due to things that are outside our control, like cloud failures.
You'd be right to think we had questioned why we do this to ourselves occasionally.
How does this link with the Hashi stack? We had a lot of problems we were trying to solve. Lots of components that need to be configured together.
We've got everyone deployed in a private SaaS model, and that's not going away, that's something we enjoy, that's a model that we choose to run as a company, and we've got hundreds of environments, and hopefully at some point, thousands, that we're going to need to manage, so this needs to scale up.
It needs to be secure. We need to support interactions with this infrastructure from lots of vantage points.
Snowplow and the HashiStack
It's not Terraform as you would write it in a traditional sense. It's very generic. The ways that we configure it are done in the Consul and Vault metadata layer, and then Terraform is really just a generic stack that we can apply in lots of different ways.
The only bit that we've custom-built is in the middle, what we call our "deployment service." This is a very thin authenticated API layer that lets us submit jobs to Nomad, to then trigger the deployments, to pull the data from Consul and Vault, and then to deploy out to our AWS and GCP cloud environments.
The current 2 implementations we have of that is our Slackbot, called "Snowdroid," as well as some integrations with Insights, which is our SaaS portal into all this infrastructure
Now João will show you how we deploy a Snowplow mini. João, over to you.
Stacks at Snowplow
*João Luis: *
Before that, we need to talk a bit about stacks, stacks dependencies, topologies, workspaces, and Consul migrations. After that, we'll be ready to deploy the minis.
Stacks is a concept that we have created at Snowplow. It's an abstraction on top of Terraform code that allows us to build dynamic infrastructure.
The best way to understand a stack is to see how it looks in code. We have some screenshots of the stacks that we are going to use for this demo, the aws_setup and aws_mini.
For the aws_setup stack, we have 3 folders. One is the version, where we do the Terraform applies.
Then we have the migrations, and then the modules. The version_mod folder is also where the code that orchestrates all of the deployment is, and the same modules are used to organize our Terraform code.
On the right, you can see the content of the sub-modules. We like to organize it using the cloud service's name. You see, for example, the folder named "ec2." This is a Terraform module where we store AMIs, bastions, keys. We also have an IAM folder, where we store code that creates roles and policies.
For example, if we were using other AWS services like ECS, we would have a folder for ECS, and then we would have code to create the ECS cluster of tasks or the ECS services.
Here you see another stack, the AWS mini stack.
The first difference that we notice is that we have several versions. We create new versions every time that we need to do a breaking compatibility change.
This is important, because it allows us to have some customers on one version, and other customers on another, and so on. We have hundreds of customers with minis, and we aren't able to upgrade them all at the same time.
Other than the versions, we also have a templates folder, where we store configs for
Terraform. Then you have an upgrades folder, where we store scripts to do some upgrades.
You can see the stack dependencies for mini. This means that, to deploy the aws_mini 0.6.0, you need to deploy, first, aws_setup 0.4.0.
The mini stack deployment is a quite simple deployment, but if you look here, you can see an example of a more complicated one. The way to read this diagram is to start on the right and follow the arrows, so that we understand the dependency arc.
For example, if we wanted to deploy the aws_metrics_relay, we would have to follow the arrow into aws_rt_pipeline, and then we would follow the arrow to aws_setup.
We would have to do the Terraform apply for the aws_setup first, and then Terraform apply for the aws_rt_pipeline, and then Terraform apply for aws_metrics_relay.
We also have optional dependencies. For example, you can see here, in aws_iglu, that arrow is not continuous. So, if the customer chooses to have an AWS Iglu Server, we would have to deploy it first, before deploying the AWS pipeline.
Here is the default deployment topology that we are going to use for the demo. We'll be deploying 3 minis into the same aws_setup. This is quite cool, because it allows us to reuse the infrastructure. We don't just reuse code; we also reuse the infrastructure. We only have to deploy the aws_setup once, and then all the minis use it.
Terraform workspaces usually are used to manage the environments, like production environment, staging environment, or the development environment, but we use them also to start to manage the customer. So our workspaces are a combination of customer and environment.
This slide lists workspaces and customer names. On the right, we have the environment name. To understand how we with these in Terraform code: We fetch the workspace name, split by underscores, and the customer name is all the bits, until the last underscore.
The environment name is after the last underscore, the remaining bit. This is important, because it's what allows us to connect into different AWS accounts.
On this slide, we can see Consul and the paths to the Consul key. It is using the information that we have in the workspaces: the customer name, the stack, and the environment.
The Terraform code, using the customer, the environment, and the stack name where it's running, builds a string that is being used to come into Consul and fetch the configuration values. This is how we start the configuration for hundreds and, hopefully, thousands of customers.
As you might be guessing, these are a lot of Consul keys, and we aren't adding them all manually. We have to be able to program to manage the keys in Consul. We call this "Consul migrations."
Consul migrations work very similar to database migrations for normal SQL databases. You see here that the contents of the migrations folders of one stack are Python scripts.
Here's an example of a migration. Let's see what this migration does.
We are declaring a key name, EBS-encrypted. For the mini, we have to decide, Should we encrypt it or not? The default value for it is 0. We have text where we generate help documentation from here. And then we have some more data for it, and we have here the type of key. In this case, it is an integer. What values can we have in there? 0 and 1, and if it's required or not.
3 Minis: A Demo
Let me recap what we are going to deploy in the demo: 3 minis into AWS, and we are going to reuse an environment that Josh created previously.
For the demo, I'm going to pull up the CLI. First we need to set up Consul. Let's go into the migrations folder, for the mini. Then, let's run the migration command, just for you to see what is required to run. So, we need the customer name, the environment, and the region where it's going to be set up.
Now, let's fetch the actual commands that run the migration. For that, we have 3 batch commands. Each one will set up a mini in the EU Central region. We also have the option
--no-dryrun, which will create the keys in Consul.
It's adding, at this moment, probably 100 or 200 keys into Consul with all the configurations that are possible for aws_mini stack.
Now let's jump into Consul, to see that the 3 minis inside of aws_sandbox were created. Now what we need to do is just change one default value and its environment name. This configuration is going to be configured to reuse the Josh aws_setup.
We change it to "josh," because that's the environment name for the aws_setup. OK, mini1 is done. Now we need to do the remaining minis.
Let me put the value on the browser line. Let's go into mini2. We do the environment name, "josh," and save. Now let's go into mini3. Setting also the env_name. Let's input the value.
Now we have all of the configurations done to deploy the 3 minis. No more are needed because the default values that we have work well enough for the demo.
This stack also uses secrets. We store the secrets in Vault.
At this moment, Vault is also set up. What we need to do now is to run Terraform, and we will have our brand-new minis, ready to be used.
Let's go into the version of the mini 0.6.0. We need to do
terraform_init. It's initializing all of the sub-modules, connecting to the backend, fetching the providers.
Now we need to create workspaces, using aws_sandbox. If this was a real customer, the customer name would be used.
Finally, we need to do the Terraform applies. There's this trick that I usually do to do multiple applies. It's just a batch command, with all of the commands.
It's starting to run the Terraform apply.
This part is going to take a bit, so let's fast forward onto the final, after the Terraform has applied.
The minis have ended the deployment, all went well, and we have the URL for one of the minis. Let's test it out, see if it worked OK. Let's input it in the browser. OK, cool. That's good.
That was the demo. Now Josh will show you how to avoid doing all of these manual steps that I've been doing to set up the 3 minis.
*Josh Beemster: *
What we've shown there is a very classic way of deploying Terraform. You're in the CLI, you're running Terraform apply, going through all those stages to initialize, stomp things out, upgrade, apply, all of those kinds of steps.
What I want to talk about now is how we do more mass deployments, with a lot more automation on top.
What we're going to demo today is upgrading all the minis that João just deployed to a new version. We're going to be doing that on top of Nomad. We're going to be interrogating Nomad logs and visualizing how the deployments are going in Grafana.
Mass deployments are very scary. You've got potential incompatibilities. You've got to know whether the current state of the environment is correct or if it drifted from what you expect it to be. Is the upgrade safe to apply?
Then there's security. With a mass deploy, you can do a lot of damage.
João's example was one at a time, quite safe and human-driven. Here, we're talking about robots deploying lots and lots of changes that might not be very possible to stop.
To be clear, we haven't solved all of these problems perfectly. It's very much a work in progress, but we've got a good base now. And we're starting to build things out, like drift detection, like automatic upgrades, that we hope to make safer and safer, over time.
We can see here the 3 minis that João deployed. They're all reporting healthy. We can see this by the healthy host count.
There's one of each, which is exactly what we expect.
Very low CPU utilization. It's not doing anything at the moment. These are kind of just dead, waiting servers.
We can see about 12 2xx requests per interval. This is just the normal CloudWatch health checks that we've got configured, to check that everything is alive. So, it's fairly aggressive health tracking, but we need to make sure that everything is looking OK.
Now we're going to upgrade these minis. We're going to switch over to Slack. Here we've got a special channel just for Snowdroid. What we've got here is a transaction block, where we're going to submit free commands to deploy aws_mini_upgrade for mini1, mini2, and mini3.
When we hit submit on this, all of those jobs are going to land in Nomad. We can follow the links that have been provided to go directly to that Nomad job. But, for the sake of this demo, I'm just going to switch straight over to something I prepared earlier, which is the Nomad page.
Here, we can see that all of them are running, they're all currently green, so everything's kicked off correctly.
We can now go interrogate the logs of the Terraform deployment. We can do that for all the jobs. We're just going to quickly have a look. By scrolling up to the heads, we can see what is being applied. We did some upgrades.
We can see all the Terraform modules being initialized, just like they were on João's side. And, we can see that things are now starting to apply.
One quick thing with these is that it is a Terraform auto apply. We have a dry run mode of this as well. For this case, we've just gone straight for it, and started applying them.
As with João's demo, though, this can take a while, about 5-6 minutes for it to apply, so we're going to jump forward in time, and have a look at the finished product.
We can now see that all the AWS mini deployments are dead, they're finished. They all finished green, so we know that there's a zero exit code, which is exactly what we're looking for.
We can then go in and interrogate the logs again. And there we go.
So the deployment has finished successfully. We can see the aws_sandbox mini2 endpoint. We can go through all the jobs that we submitted and have a look at the logs for all of them. All 3 have been deployed and upgraded successfully.
Now that we know that they've been submitted successfully, we want to have a look at Grafana, because that's where we're monitoring these things, where we're looking for the success of this upgrade.
What we can see is a gap in the monitoring. When we hit this deployment, what happened first is that, 2xx requests went to nothing, because we shut down the server to replace it with a new one. That took about 2 minutes to come back online.
We've then got the CPU utilization. That also dropped to the floor, but came back to life.
Then we've got the healthy host count. That took a lot longer to come back to life.
The reason for that is just that our health checks are much lengthier than the time it takes to deploy a new server, to make sure that it launches correctly. So we don't want to be too over-eager. We don't want to be too lazy about it, either.
Everything returned back to health, so we've got a successful deployment of 3 minis, in one Snowdroid command.
That's where we are now. We manage to do these deployments, not just 3 minis at a time, but often 100 at a time. We're doing a mass deployment rollout, where we're then monitoring the health checks, the Opsgenie alarms, and Grafana dashboards, to make sure that everything is working successfully.
This does have some risks, so depending on what service we're upgrading, we'll do different amounts of mass deployment.
But where do we want to take it next? One of the struggles is needing to put automation on top of automation. Terraform has modules, which then bring resources together. And we're now at a level above. We've got stacks that we need to start combining together, and we've found no natural way to do that.
We're looking at building a tool that we're calling internally "Snowtopia," which is going to be that level on top.
How do we combine multiple stacks in a logical topology to then deploy lots of these systems together?
An aws_setup for 3 minis, we could script that. We could build that into an actual deployment, a recipe, for lack of a better word.
This also opens up the chances to do easier multi-cloud deployment, using everything cloud native, but deciding to stitch all of these disparate systems together into one big multi-cloud pipeline deployment.
The most exciting thing for me, and for the team, I hope, will be self-service private SaaS. How do we let our clients manage their infrastructure in a private SaaS model without myself or the team having to execute anything?
How do we let them manage their own minis, their own pipelines, their own infrastructure, but Snowplow-branded? That's the future.
Thanks for listening.
*João Luis: *
Thanks for listening, and see you next time. Bye.