Learn how Snowflake's security and engineering teams package modules to enforce secure defaults and good RBAC design.
Hi. I am Prashant Kommini.
And I'm Andre Fedorov.
Our talk is about packaging security into HashiCorp Terraform modules. We are part of the platform automation and tooling team, which is part of the global security engineering team at Snowflake. We think a lot about how we can bridge the teams of security engineering, software engineering, and platform engineering. And so we think a lot about how we can optimize our DevOps and move towards DevSecOps.
When I joined Snowflake a little over four years ago, I joined an engineering team inside a security team inside an engineering-run organization. Things have shifted around since, but the nature of that relationship is still the same and will probably remain the same forever, because Snowflake is an engineering company. The value we bring to our customers is engineering data: engineering data processes, engineering data governance, engineering data things in general. And when your value comes from engineering, you have to surround even key functions like security with engineering interests. So that's what this talk is also about — how security done in a way that aligns with engineering processes is significantly better — through a couple of stories and through slides.
Today in this presentation we will cover infrastructure as code, or IaC security challenge, or security's IaC challenge, depending on how you look at it. We'll also present some ideas on how we can address this challenge. Finally, we'll also go through a case study of an internal tool that we implemented called GEFF, or the Genetic External Function Framework. I'll cover the conceptual learnings and understandings from our experience deploying GEFF multiple times in various iterations and evolution of those implementations, and Andre will cover the institutional experiences from that evolution.
I really like “institutional experiences.” That makes it sound so good, because this is the first institutional experience [shows slide of McLaren F1 sports car]. I don't know if anyone recognized this, but this is Elon Musk's first fast car after selling PayPal. One of his first purchases was this car. And his friend Pete, co-founder of PayPal Peter Thiel, injected a dangerous idea into this exciting car. He said, "Show me what this thing can do." And Elon Musk being who he is showed him what it can do. Over the first hill he showed him how it can really fly like a saucer and turn upside down and crash into a tree.
That's kind of how the first deployment of the GEFF idea was. Very much because having just accepted a kind of a handshake offer to be a manager, I thought I should make slides. These are the slides I made.
These explain how an engineer would think to explain to other engineers, using slides, a lot of very exciting ideas. In short, it's like “you can run SQL, it calls all sorts of APIs, and the APIs respond, and it's all distributed in Webscale, comes back, and all the responses from the APIs are all in your SQL.” And that was so exciting that I talked, and I talked at lunch talks, I talked to my team, I talked to the security teams around Snowflake, and people listened. Some of the people that I wanted to listen, listened, and they kind of put this on some pretty high-profile projects. That'll be story number two.
But story number one: some people listened and talked who kind of just ran with it on their own, who took these ideas, maybe looked at the code that I wrote and borrowed some of it, but did duplicate threat models (most frustratingly, in the end). And — a much longer story made short — they landed us in a place where we became a team that started asking, “Well, we've got these multiple deployments with multiple threat models, and we now have findings to reconcile the threat models to make sure of security findings to fix these things.”
But these teams, they're race car teams. They are done, they have presented, they've built a ton of stuff on top of this. Fixing security findings is not their most important priority. In one case, they haven't even closed the poll request that they've deployed to close the security finding, and they’ve landed the Snowflake security community in a place where we're wondering whose responsibility it is to make sure that the poll request is closed after a security finding is addressed.
That's the kind of thing that I hope by the end of this talk we’ll teach you how to avoid forevermore, at least in larger institutions.
What we didn't have at that point was a way to package our ideas. We had our ideas as slides, but we weren't able to package them as code and share that for teams to avoid misinterpretation of our ideas into bootleg implementations of our ideas. That gets us to the challenge with IaC is that it's too fast. We are able to just use someone else's code. Mature organizations publish their patterns — package up their patterns and publish them in the registries or just GitHub repos — and we are able to import them and run with it and reduce our deployment windows from days or even weeks to a matter of hours or sometimes even minutes.
While deploying in iterations and over time gives us time to think about the security and refine the security before our production deployments, with this sort of an expedited window of deployments we are unable to think through the security, and we are not able to engineer the security over time.
Moving Fast Leads To Failing Fast And that is a good thing in software engineering, but failing fast in security engineering leads to being vulnerable fast and also being compromised faster. There's only so many times we can blame an intern before being held accountable.
What does security engineering really mean in the context of the software development lifecycle? Where does it actually occur? It occurs in every phase of the SDLC, but it most heavily occurs in the system design phase, in the form of threat modeling. Threat modeling breaks down into these four steps of diagramming our data flows between the various components of our system, identifying the threats that apply to these data flows, also implementing the security controls that mitigate these threats, and finally validating that these security controls are implemented right and actually work and are effective in mitigating the risks associated.
As you can imagine, the process breaks down into these steps, but in reality, the tactical reality of implementing the threat model in an organization can be a lot more nonlinear. This requires all of these various teams to work and play well together. And that can cause friction if not done right or if the technologies don't support the people and processes. Now that we understand what threat modeling is and we understand what the process breaks down into and what are the teams that have to work well together, we are in a better position to appreciate why security engineering is slow.
Various teams across the organizations implement the same patterns again and again. They threat-model these patterns. They have to work with the security teams. The security engineering teams take some time to see that these patterns are repeated — they're duplicated patterns — and they want to recall that threat model that they previously did, that existing threat model, and take the learnings from that threat model and use them for this new deployment. So it takes a couple of cycles where this could all be avoided if we were able to tie a threat model to a module or a piece of infrastructure that we've already implemented. That way we are able to build on top of infrastructure that we've already built.
There are a lot of threat models that are duplicated, and this is one reason for the delays in security. One other delay is that, especially with distributed systems that span multiple zones of control, you have to work with more than one security team to threat-model components of your distributed system. Especially when these security teams are distributed across the globe and across various time zones, this can add to the delays. Finally, security engineering is iterative, meaning that the teams that diagram the data flows are different from the teams that identify the threats. And that is different from the teams that actually implement the security controls and, finally, also different from the teams that validate the security controls actually work.
So you have the cloud engineering teams or just software engineering teams that diagram the data flows. You have product security or security engineering teams that identify the risks and call out the security controls that must be implemented. You have, again, yet different teams like cloud engineering, infrastructure engineering, platform engineering, whatever you call it, that actually implement the security controls. And finally, you have compliance — red team, pink team, purple team, as have you — that implement or that validate the security controls. All of these teams have to work well together, and they add checks and balances to this process of engineering security into our deployments. But when this many teams have to work well together, that can be a little more messy as opposed to the pretty process on paper.
That's kind of how it feels like, a little more messy. This slide is 1812 [shows slide of ragged soldiers trudging through snow], and this is the victorious army that invaded Russia, accomplished their goal of burning down Moscow — or I don't know if that was their goal to start, but they are a victorious army. That's kind of how the second project, Project Napoleon, felt.
If you look closely, two of those are interns. One of them is deciding to apply to law school and that she doesn't want to be a project manager after all, she wants to be a lawyer. And the other is a little starstruck still because he just finished freshman year and he's sitting in meetings where very senior engineers are arguing, "Is it my job to draw out the state flow diagram or is it your job? And who's making a commitment to whom and who's deciding whether the commitments are good enough and who in the end is going to be criticized in the eventuality that one of those security teams comes along and pokes some holes in this infrastructure?"
While the project was a success — it was a high-visibility project, so we kind of got the resources we wanted to — there were times when it was a lot less of a success, at least in the minds of the people engaging it, specifically even the people pretty high up. At one point, my boss's boss's boss was pretty mad that his boss had somehow misunderstood the timeline this was happening and that it was all happening too slow and that he hears it's a mess. And we, as the people who were on the ground trying to implement this correctly, were feeling like the folks on the slide, a lot less like the victorious people that in the end did ship an impressive piece of software in a tiger team across organizational boundaries.
Again, this in hindsight seems a little sad because it's all avoidable, and it's really something that in hindsight is almost trivially avoidable. But let's not worry about that and go onto the actual restatement of the problem and solution.
In this experience, this was the second deployment of the same tool. It was a second iteration. So we were able to package all of the infrastructure in Terraform. Now, instead of sharing it as slides, we have a module, and we were able to share this module and deploy that in multiple instances in a repeatable way. But the security of that is still in question because of distributed ownership. We all feel a little bit worn out at the end of the process, because who implements the security controls is different from who actually validates the controls. If all of this were somehow packaged into the module, we can at least avoid the repeated threat models, the repeated implementations of the same security controls that have been previously implemented.
Refocusing on the problem, infrastructure as code — or DevOps powered by infrastructure as code — has evolved across the last decade, especially in the last five years, to become really efficient. But the security engineering aspect hasn't really evolved much over this decade. Especially the threat modeling aspect of security has remained as a key principle that still needs to gain refinement and improve as a process to be more automatable and also to be integrated well into the DevSecOps life cycle. While security engineering done right is slow, we can improve it and integrate it better into our DevOps workflows.
We've been in more than 20 threat models that repeat the same patterns across even a few months where various teams are working in isolation and vacuum and they don't know about each other's work. If they knew about it, they could have an expedited threat model and also expedited deployment, because they can totally avoid that threat model because components of their system are shared.
Being part of all of these threat models, we saw three key aspects of security repeat in every threat model. If we can somehow package these up into our Terraform modules, we can actually expedite our threat model, have more meaningful conversations that start from having a basic level of security and evolve to actually doing much more complex things with our deployments. What we saw is packaging up observability, having a basic level of security that we package into our modules, and packaging role-based access control can make for more effective threat models and consequently speed up our deployments.
But how do we package observability into our Terraform? The first thing we can do is, we should enable all of our logs, DNS query logs within our VPC, to essentially signal that we want to know exactly what domains are being queried from our systems. We also want to know what incoming and outgoing connections are happening within our services. We also want to have clear visibility into what, who, from where, and when users and roles are accessing our clusters and resources. So we want very clear access logs, we want a consistent access log strategy across the organization, and we want all of these to provide information and provide visibility into our VPC.
Next time someone is implementing a VPC, they shouldn't have to miss basic configurations such as these. Packaging them up into a VPC module and sharing that into our registry and tying a threat model to that VPC — next time someone repeats that and they're implementing a VPC, they just use this module and they get an expedited threat model.
So, we want all of our metrics to be collected from our EKS, ECS clusters, and also any errors need to be tracked, and we want to package these error-tracking systems into our clusters — Sentry, Prometheus, what have you. We want all of these tools, either open source or enterprise, we want those packaged into our modules and any teams using those modules to be incentivized. That will be our security default. Anyone overriding those defaults, we want them to actually have a threat model and discuss their ears off about various configurations.
What all of this gets us is attack provenance. Let's say an attacker was able to gain initial access to a low-value target. We may not have all of this visibility into low-value targets, but let's say from the low-value target they were able to make their way up to the higher-value targets shown lower in this hierarchy. Because of the high visibility of higher-value targets, we are able to immediately detect this compromise and quickly respond and start our remediation efforts. But we need to backtrack this entire attack path all the way up to the source. Towards that end, we need every node in this attack path to actually have all of its logs enabled for us to actually establish this reverse attack path or attack provenance. Packaging observability, this is the benefit we get. And the way to do so is to have all of these basic configurations at your VPC level, at your cluster level, and all of the resources packaged up into our Terraform modules and incentivizing usage of those Terraform modules.
The second thing we want to also package is a basic level of security. And how we can do this is by not using the resources themselves but using modules that wrap these resources with security configurations that make them nonpublic for our block stores, that have security groups that deny by default, and all of our data resources — databases and volumes — essentially behind the VPN.
Also, we want the simple toggle switches that enable KMS encryption, we want all of these turned on so that it's hard to miss and sort of disable encryption of our KMS secrets in our EKS cluster or on other clouds as well. Also, we want our private traffic confined to our private VPCs through usage of VPC endpoints and split DNS for all of our private traffic. What all of this gets us is that we are able to refine the conversations we have in our threat model to start after we have the basic level of security and not discuss a whole lot about the basic things that must be part of every secure deployment. This gets us the secure defaults into our deployments.
The final thing we want to package into our Terraform modules is role-based access control. How can we do this? We have this Terraform block, and within that we have required providers. This is a commonly used feature because we have to use a couple of providers in order to create some of our resources. Many folks use it to have region-specific providers. Many folks use it to maybe have multiple providers. What we were able to do is to also use these configuration aliases to specifically enforce the use of roles. So we are able to have much more granular roles and have role-based providers and enforce that these are the aliases that should be injected into our module to be used securely. This way, someone who sees our module can immediately see this file and know that “towards deploying this module in my environment, I'm going to need this level of access.” And these are the resources that are created with this role.
As you can see, within the Snowflake provider, we have like five roles being used here, and it creates various account-level objects, it creates various admin-level objects, database-level objects, and schema-level objects. Similarly for the AWS resources, for the AWS provider, we have the same thing. We have an admin role that creates IAM (Identity and Access Management) and security groups, things like that. We have team-specific roles. We are the platform automation and tooling team, so that is one role. We support the incident response team and threat detection team; those are the other roles that AWS accounts, AWS clouds have.
Usage of the configuration alias is one way to package the roles that we need or the roles that must be used with our module. The other thing we can do is to use granular roles in the providers — that is, when instantiating the module, we need these injected. We'll see a little bit more about this. That way, all access to that resource is controlled through that role.
We referred to GEFF in passing, but this is really what it looks like on an architectural level. You have Snowflake resources being created, you have AWS resources that are created, and all of these need to talk to each other. We were able to deploy this cross-cloud service in a repeatable and secure way. We were able to package role-based access control into this Terraform module, and we were able to share this across our organization for deployments in a way that doesn't cause us to be worn out.
As you can see in this architecture, you have the API integration on the left. That has a role tied to it on the Snowflake side. That role uses chaining of roles and assumes a role on the AWS side of the architecture. Then we are able to work Lambda functions and then use that with the GEFF API framework. So GEFF API essentially allows all of the Snowflake external functions to use a consistent language and interact with external endpoints and extract and load that data into Snowflake.
I don't want to focus too much on the GEFF aspect but more on the learnings that the installation of GEFF across time gave us. I just wanted you to know enough about this tool that we are talking about to understand how our first instinct in this architecture was to just use admin roles on both sides to create all of these resources. But across time what we found was that the proliferation of access to these resources became uncontrollable and that we had to chase down all of this access and actually revoke a lot of accesses. And that took a lot of hours to actually limit and actually “unproliferate,” if that's a word.
Towards that end, what we were able to actually do is split the usage of this admin role into more granular roles and then enforce these through the use of configuration aliases. With that split, our module now looked like this where we have two different clouds. Within each cloud, we have two or multiple roles, and within each role we have a limited set of resources created. Access to these resources is sort of controlled through that role. The same thing with the AWS side of the resources. We are able to make our roles much more granular and limit a proliferation of access to those resources.
Now we've packaged or enforced the use of our backend through our module, how do users actually use the module? They create these providers at the instantiation time and inject those providers into our providers block of the module. You can see how we have created the API integration role, we have created the storage integration role, passed that into the providers of the GEFF module instantiation. Now, within the module, we have references to these roles. We tie each of these roles to the corresponding resources that need access or control access to those resources. The API integration role creates and owns the API integration resource, and anyone who needs access to API integrations across the organization can be granted this role. Same thing with the storage resources. The folks that work with the API integrations are separate from the folks that work with the storage integrations, and their roles should reflect that.
So we were able to package our backend into GEFF and the first two iterations where we were able to deploy it, but we were worn out in the process because of the cross-team distributed ownership and a cross-team friction as to not knowing exactly who has to validate the controls, who has to implement the security controls. Now, with Terraform, we are able to package all of this into a Terraform module and share that in the registry. Now folks are able to just import the modules and implement them.
This is how it feels like. Specifically, this is how the last deployment of GEFF went, because in my mind I have some ideas as to how I want to deploy it. I brought it up in a one-on-one with someone who I ultimately hope will deploy GEFF. And he said, “For something, I deployed that a couple of weeks ago.” And I go, “Wait, for what task or what project? Just for fun?” He's like, “Oh, no, no, we had no PR. It was the first one that we finished this quarter actually, because it was so quick.” And I was like, "Well, how did you do it? Where was the first model? Why was nobody involved? The C-suite didn't find out about it." And he goes, “Oh, I just took the module that Prasanth wrote and then I read it. It was interesting. But then I deployed it.”
That's kind of how it should be, right? Engineers should be working together. Maybe there's a manager within earshot but not necessarily listening. And that's where I want to be as a manager. I want to be able to tell people about work that's already done rather than the other ways of doing it with Race Car or with Napoleon. And the shoreline really also embodies, I think, the turbulence of the ocean that is ever-present in corporations and the world at large, the whirlwinds and everything else that kind of sucks you in order to do the day-to-day business. It's possible to not engage with that — to operate with knowledge of it, but to operate separately from it.
That's kind of what we learned, that we could use Terraform just as an infrastructure tool — write down things so that we get all of the code review and goodness. But really where it shines institutionally — where it really aligns with the software engineering way of doing things, with the SREs (site reliability engineering) workflows — is to really treat it as a distribution mechanism for modules, even internally, even when people are working in the same code base. To really just treat it as a way to create artifacts that are secured once and then that in the other iterative process — the iterative process of folks reusing code, reevaluating code, re-finding new threats or subtle ways that things can be improved — if it's distributed as a Terraform module, you can just give people a new version. They know exactly what to do with it, they update it to the new version. They might be curious to know how you improved the threat model, but ideally they don't even worry about it as much as perhaps security engineers would like everyone to worry about.
So that's kind of what we recommend to you. That's kind of my hope for everybody here, that you can avoid communicating using slides or at least communicating ideas using slides, even though we're communicating ideas using slides now. Try to communicate ideas when you can as code. IaC might mean infrastructure as code, but it could very well mean ideas as code. Because we should be able to bake our ideas into Terraform. Even though it might not be the perfect language now, I think in the future it will be (something that we can talk about separately if you're interested). It's definitely something where the processes of engineers, the processes of both software and infrastructure engineers, will align significantly better with your ideas if you do.
I am not a connoisseur of art or anything, but I get frictionless coexistence from that. That is how teams should coexist, because our processes, our technology should align with — we have a hard time working together across teams as it is without processes and technologies adding to that friction. We want our processes and technologies to assist and reduce friction within our teams. That is what Terraform helped us with. What used to be this four-step process of various teams being involved in diagramming, identifying the threats, mitigating the risks, and re-implementing the same things again and again, was reduced to a single review of an existing threat model and review of the Terraform module that was tied to a threat model and then just reusing that Terraform module.
To leave you with a solution to the IaC security challenge or securities IaC challenge, IaC itself is the solution for the security or for the speed of security, meaning that we don't want to look at deployments in isolation or in vacuum. We want to look at all of the deployments that are similar as a single pattern and threat-model that pattern, package that up with the security baselines, with observability and with role-based access control, and build across time a registry of golden modules and share that. And anyone who uses a golden module is incentivized with an expedited threat model and an expedited deployment.
That way, communicate ideas as code or infrastructure as code as opposed to as slides, and shift left the usage of Terraform much earlier in the SDLC essentially. We want to package patterns into Terraform — threat-modeled and secure patterns. One last thing we want to leave you all with is, we think of Terraform as being used mostly in deployment and adding value in the deployment phase of the SDLC. But we have actually got and extracted a lot more value out of Terraform using it at the system design phase of SDLC. And we are able to shift left the usage of Terraform to also add value to the security and to expedite threat models and have more meaningful conversations in the threat models that start from the baseline of a basic level of security.
Again, we are the Snowflake security team. We work in global security engineering. We solve problems related to data management, data access, data governance, data processing. And we try to use all of these to improve the detection and response pipeline in the security team at Snowflake. So any ideas that overlap, we're happy to discuss them further if you bump into us.
Specifically, the semantics of security being represented in Terraform is something I'd love to invite folks to talk about. If you have ideas here, I think the future of security is significantly faster than it is now. We mentioned earlier that security hasn't quite caught up, and I really believe that. And Snowflake works very hard on a lot of data engineering problems. Expressing the semantics about data and semantics about code inside of Terraform I think is one of the most exciting things that can happen in the next…soon.