Enterprise security at Palantir with the HashiCorp stack
In their talk at HashiConf 2017, Palantir's Elliot Graebert and Vlad Ungureanu review DevOps best practices, and discuss how to implement them without introducing excessive friction.
DevOps can be a tradeoff between security and ease of use. Few companies take a practical approach to security: They end up over-focusing on compliance, instead of effective defense and detection.
While it's easy to install HashiCorp Nomad, Vault and Consul, there's a large difference between installing tools and securing them.
In their talk, Elliot and Vlad review industry-wide best practices, and discuss how to implement them without introducing excessive friction. Topics include:
- Best practices for securely deploying infrastructure: encryption, separation of access, MFA, DR, Approval based deployment, logging and telemetry
- Security enforced by practice is inferior to enforcement through automation
- AWS design: subnets, routing tables, ELBs, SG’s
- Encryption at rest and in transit
- Encrypted AMI’s with Packer in a multi account environment
- AWS ACM, Fabio, and ipsec mesh
- Separation of access
- Environment isolation with some shared resources (DMS hub-spoke)
- Private vs. public subnets and administration with bastions
- MFA based SSH/admin workflows: Keberos, Vault-ssh-helper, Archmage
- Backups/DR from a defense perspective
- Multiple accounts
- DR lambda functions and KMS automation
- Approval based deployment: MFA pre-receive Git hooks and GPG signed commits
- Centralized telemetry and Logging
Infrastructure Team Lead, Palantir Technologies
Backend software developer
So we're here to talk through "Hashistack" security. But first if you haven't heard of us we're Palantir, we work side by side with a lot of the major organizations across the world, helping them to get the most out of the data that they already have.
Our idea is trying to augment human intelligence with software, not trying to replace the human aspect between the two. So our team is responsible for making developers across the company move faster, so I'm a leader on the team and Vlad of one of our tech leads.
So I'm a former software engineer, I turned systems over time and like I least hope some of you in this room are interested in the whole infrastructure as code fad, which is kind of fun, and we're trying to attempt to automate all the things. So our team is responsible for running everything from the build system to security tools taking this DevOps mindset. So one of the biggest issues we face, is we run a thousand applications. This is actually a correct ratio of about the number of people on the team versus the number of services we have to support, and we're talking everything: Windows, Linux, OS X. From one-off scripts to web apps that we write, we have to do the whole gamut in pretty much like every language under the sun. At least it feels that way.
So breaking this down into its components, there's two perspectives I want to think about here. One is the developer's perspective, and one is our perspective.
So developers of course they don't want easy integration to core services like DNS and secrets and monitoring and stuff. They want self-service because who wants to spend a lot of effort? And on our side we want to do standardization with just the right amount of customization, while always working on improving our impact and reducing maintenance. So, you have to balance this out across the security controls that need to meet for your organization and a container schedule ended up being like a really good choice for a large set of our workloads. So we end up going with Hashistacks. We came last year, last year was our first year at Hashiconf. We came out here, we went to Kelsey Hightower's talk—phenomenal speaker, absolutely loved it—and immediately went that weekend and started deploying a new Hashistack.
So today we deployed over 100 different services, and like small jobs on top of it. We've seen some significant improvements for the company. We've seen reduced deployment from about a week to a day, significantly less time spent on logging and maintenance and patching and things like that, and as a result we have been able to support a lot of other small services that we couldn't afford to support normally. At the same time as doing this, we decided to do an overhaul against the way that we are running infrastructure so it was an interesting chance to try out some new tools and new approaches, while still maintaining our compliance.
And that count leads me to the next part of the talk, which is there's two aspects I want to talk to you today about securing your infrastructure. So first of all, there's like, on the host level, you've got like your encryption access, segmentation, patching, centralized logging, MFA, defensive backups. And then you've got the less talked about one, which is the configuration.
So while there is lots of talks are going to be about policies and Nomad policies today, there's the question of, if I pop your laptop and I have your SSH key and you're an admin on your own repository how much badness can I do to your environment?
So, to frame this whole talk our director has a great tagline that I love to use, which is...
» Make the secure thing to do the easy thing to do
Well we could put our own spin on this because we're DevOps, and we're automation people who are super lazy right? So how can we make the easiest thing to do absolutely nothing?
And by this I mean where can we use automation to enforce the controls as was done in the keynote right policy as code or whatever the tag line was there. So we're going to go over several aspects of security, specifically how they apply to Hashistack, how we've reused Hashistack for several components of rolling out the new security tools. But before that let's take a quick step back.
If you're not familiar with Hashistack, super watch the Kelsey Hightower demo it's great. Let's take a quick review of the architecture. So this is normally like what you would see on an individual slide, you just see Vault, Nomad, and Consul, just these components. That's super ... Like this is easy right? All you got to do ... Like that was the pitch line. You just install these three things and you have an enterprise grade Vault, you know Hashistack.
(Just a caveat real quick, I am going to use an AWS specific implementation to make it much more real, not fudge things.)
Okay, so right off the bat, pretty common one we actually combine our Nomad and console servers together, one less AMI to deploy, nice and simple. Oh, yeah, Vault's in a cluster, Nomads in a cluster, yep all fine. Let's take a quick look at the workers you use you know, Fabio, there's some new one's out there nowaday doing load balancing, shifting the traffic over to the apps, all normal. Looks something like this right? Well actually wait a second. With multiple types of workers, so yeah you've got your Windows workers, you might have your Linux workers, maybe you have a whole separate worker cluster in a separate environment with separate security groups and separate load balancer for like high trust and low trust environments, that's a Hashistack though. Except for the persistent. So we have RDS, you might that have S3, various back end parts. You need to get your logging done you need to get bash and host access.
Okay, so let's just call that one Hashistack. All of the HashiCorp components, all the persistent components, all these other things, and you call that one thing? And probably firewalls between you and your users probably multi region. Probably multiple accounts. Probably a backup account. Like the scheme of what you have to do to get an entire thing deployed in a brand new environment is far beyond just the initial components, and the giant plethora of tools that you all use and work with like Pagerduty and getting Nagios set up and getting all of your, you know, maybe your Datadog components all put together like, you know, everyone raised their hand at the keynote today so all of you guys set up something somewhat similar to this in your own way.
Let's go through securing this. Six aspects, but right off the gate, I dropped the first two. Encryption is like the "just do it" of security. You know, at rest, in flight encrypt your AMIs like just don't use weak ciphers. Just do it and accessible could be like this huge long winded talk about security groups and firewalling and when you use iptables or IPsec mesh or like what you're doing around it, and it's not that interesting. These other four things, as much as you can call patching interesting, actually are a lot more fun to deep dive into.
So, just to review something that should be obvious for everyone ... Why is patching important? Well the obvious one is if there's a known exploit out there, that's going to be so much easier to break than someone discovering an exploit, right? Well the thing I found very interesting though was, as part of doing actually research for this presentation, we tried to look up the metrics for how long does it take for a new zero day to go from being created to being exploited in the wild and found something from Malwarebytes saying that that average has dropped from months down to four days. To a world where you might be doing a quarterly patching cycle or monthly patching cycle, even thinking you're good with the weekly patching cycle, you need to react in a matter of days now in order to actually stop from a known exploit.
So what does a good patching pipeline need to have? It needs to be in phases. Staging and Prod, you need get the whole environment uniform you don't find that like one box that was sitting in a corner that wasn't patched for two years because you had a bug in your Puppet code. You know learning, for when inevitably, because everything always fails forever and then your ability to roll back from that inevitable failure, and then you need be able to roll this in a hands-free, touchless environment.
So we ended up going with immutable AMI pipelines over Puppet, especially for its ability to roll back. So, most of you should be familiar with this kind of pipeline We've got Packer to bake the image, Terraform to roll things out, a way to balance the nodes, and then of course you've got Terraform enterprise itself. So quick look, we make a single AMI where we put Packer. Vault, Nomad, all into the same image. This is nice because it's one AMI to bake, and then you can use custom scripts to turn that node into an individual console node or Nomad node or Vault. And then we share that image out to all the various sub accounts. As you can see here from the Packer code, what's kind of nice is that this actually makes version upgrades for the HashiCorp components super easy, you just drop in the new variable it just rolls out.
So we have Packer breaking the image. Now we need Terraform to deploy it, all things that you guys already know. So probably three autoscaling groups, as we said earlier, probably a lot more than that but at least the set for Nomad and Console, Vault and the workers. Because this an AWS specific implementation, you've got the launch configuration that then determines whether it's a Nomad worker, which cluster it goes into, and stuff like that. And then the standard tear from data source for finding the AMI.
The fun part about this one though is actually rolling the nodes. So if you aren't familiar with this, updating a launch configuration does not update the actual nodes in your account, it just changes the template. So how do you actually balance a Hashistack? Well this is the pattern that we follow, you guys might do some slightly different. So rolling console Nomad, delete a node, add the new node in, and delete a node add the new node in. Vault, same idea right? Delete the primary, except we gotta execute the step down.
So we've got a little bit of requirements to build a support pre- and post- actions, and they've got the Nomad workers themselves. Those were great like red/green style, right? You expand, you add an extra like twenty or thirty Nomad workers into your queue, and then you slowly drain your other ones out, maybe do some canary stuff. So what we ended up doing to solve this is really wanted to stick with all the benefits you can get out of Terraform without doing an out-of-band service. So we built a simple Go binary, where the idea is you just inject it in as a null resource. Triggers off the launch configuration change, targeting that autoscaling group. Now the idea behind this is it takes advantage of terror from concurrency. So if you have unrelated facts like your log stash cluster, your bashing cluster, and your Vault servers, those can all bounce in parallel because they're unrelated to each other. However the Hashistack probably should be balanced in a specific order, so what you can do is just have the worker depend on Vault, Vault depend on Nomad console, and then you can cycle things in an exact order.
Now the nice part about this is, that if a Terraform resource fails, right, it taints the resource. When it taints the resource the next time it goes around again, it's going to temp to recreate it, so as long as the logic in your code can actually rebalance the node, then you can actually start to make it so that Terraform is an indicator of whether or not your site—your infrastructure template—is not just laid out, but has been fully cycled. There we go. So tossing in the last part for Terraform Enterprise. So, we used Terraform Enterprise ... Oh did I ... Oh I did skip a slide didn't I? Oh all good ... So with Terraform Enterprise, you can have basically:
- Phase one, baker AMI.
- Phase two, kick off the staging part of the build. And
- Phase three, kick off your production set.
And when you can chain those all together—which is a future coming in with Terraform Enterprise 2.0, actually not available in the original—you can end up getting an entirely touch-free environment as long as the Packer build just kicks off once a day, and any of these results can just trigger into a Pagerduty or something like that so you get like a fully automated hands free touch less pipeline.
So tying back into our framing device here, right? How can we make doing absolutely nothing the secure thing to do? Well by having a touch-free environment that patches itself every night and cycles all the workers every night, you not just get the advantage of making sure you're avoiding drift, but then you get one of the shortest windows that you can towards patching in a known exploit.
So swapping over to Vlad.
Vlad: So let's talk about centralized logging now. Okay, why is logging important? So, imagine it's late night, you have an outage, you want to go and debug.
What happened over there? You want to find the logs. Well you don't have them so that's the first problem. And also imagine your search team does an investigation and they don't have all the logs from the environment. It's really hard for them to debug the actual trail of events of what happened over there. Besides that, you have a lot of first party apps that to deploy you need to grab logs from Duo. You need to grab logs from Cloud trails, and then you have a Vault logs, and then you have your Nomad app logs. You need to centralize all these in a place. Let's establish what good looks like for a centralized logging plane.
You want your logs to drop in your SIEM like five minutes or less. You want to be able to react fast to events. You need support multiple long format so you'll have JSON logs, you'll have HPD logs, EngineX, Cloud Trail logs, like all of them. You need longer test policies for legal reasons, so one, two, three, maybe four years to keep the logs. You need to clearly identify from where each log line came from, so you need to know hey this log line is from that instance. And also everything should be opt out rather than opt in. Like all the logs should be available by default.
So when we started, we had this image over here. We had the Nomad worker that piped all the logs to the SIEM it's kind of fine at this point. And then a new team came to us they started playing some house on the stack and they're like "Hey we want these logs to go into separate place, can you do that for us?" We're like okay, we're going to install this different one for ... On the worker. So at this point the worker ships logs in different places. This doesn't scale so imagine having hundreds of workers doing this thing just it's not good. We introduced logstash into the game, so now logstash does all the log bifurcation. The host just like ships everything to logstash, and everything falls from there. Also the Logstash config can do some ACL'ing so logstash knows from where each log line came from, and it can drop it into different the indexes in your SIEM. Let's look at how everything's implement over there. All our hosts come with Rsyslog, by default, so you have your kernel logs, sshd, and cron that go to rsyslog. Also when the machine boots we withdraw from the machine custom rsyslog template that we're absolutely logs into JSON and also attaches the instance ID and the Amazon tags to each log line, and then everything goes into the log stash.
And then you have the logs from Nomad, Console, Vault, and Docker. Those go to journald, and we configure journald to these send everything to rsyslog also. And then drop to the Nomad app logs. The actual applications that run on the Hashistack, we have a cron. We use filebeats to grab all the logs but we have a cron that runs every minute that rewrites the filebeat config to crawl paths and know my location directory. And over there we attach different fields to each logline. We attach the alloc ID, job name and task name. So let's see how logstash ships to different indexes now. So logstash receives the logline, finds different fields in the logline, and based on decides in which index it's going to ship the logs too. So imagine having syslog, you want those to go in the host log index, and then you have your Vault audit logs. That goes into the different index for the infosec team. And then you have the actual application logs, those go in the different index for developers.
So in the case when you have an outage and you want to investigate what happened with the machines in autoscaling group, you can just search by using the Amazon tags that you have defined on them. Or you for a developer you know your service name or allocation ID, and you can just search everything in the Nomad app logs. So, as a developer in the Nomad job file just add a tag. Log index equals datateam. Logstash is going to read this tag and is gonna ship all the logs in the specific index.
So let's shift gears a bit and talk about telemetry. We won't name any tools, but you should ask your infosec team what the telemetry agent they prefer. So using at telemetry agent, you can see the running processes on the box, you can investigate what kernel versions have deployed. You can see binary versions or you can actually look at network connections. Which process doing network connections to other hosts. Also you can monitor like high risk files. Imagining monitoring authorized keys for different users on the box or /etc/hosts. Or /etc/passwd. Or actually binary integrity ... Maybe somebody dropped, popped their box. Modified the critical banners over there.
Now back to our security principle: make doing absolutely nothing the secure thing to do. All our hosts come with this logging pipeline defined by default and all the Nomad app logs are collected by default, so you as a developer just focus on when writing your Nomad job file and deploying it and you have all the logs are there. Lets take now about MFA. So why is this important? Just read the news, see the attacks and mostly, a lot of the attacks can just be prevented by having to factor off on the accounts.
You have a lot of different apps deployed so your custom apps and then you need to MFA your cloud provider. You need to MFA Vault, you need to MFA GitHub, Terraform Enterprise, CircleCI and then at the end you need to MFA also SSH. So, let's assume for a moment that all these web apps just use SAML and you put MFA on your sound provided that you still need to MFA sh. So, this actually happened to me a while ago, I got to page one of the causal nodes stopped working and I was like OK, let's see it and see what's happening over there. So first need the IP of the host you want to SSH to. You open up your Amazon console, you type in your user and password, press enter. You unlock your phone to get MFA secret, you open the Google authenticator app or the Duo app, you type in the six digits over there, and okay I'm in the account, and I go to the region, I click EC2, I click instances, I filter down by the name where is this ID, and OK now I have the IP address of my instance. That's fine.
So as Eliot said, we have multiple Hashistacks and multiple regions and each of those of those have aa specific bastion. Now I need to find ... The bastion I need to SSH to. So looking the bastion DNS name, I docs, maybe somebody actually saved all the DNS names over there. Okay I found it. And I need search the SSH password command, which I mean if you remember, okay I found it. I typed out that's fine. I unlock my phone because at this point a bastion issued an MFA prompt to me. And I respond the push, I say OK yes. And I got rejected. It's like so many steps you don't even know what went wrong and all these actions, it's non-trivial to debug this.
So we have a couple of goals here. We want to simplify finding real hosts. So finding the target host you wanna SSH to, and the bastion you're using for. And simplify SSH, like password command, really annoying and you want to reduce the time to outcome, so doing all that workflow I still have an outage. Like that's that's actually bad. So we just we're using like basic tools. So we're using SSH, we're using the Amazon fly and Vault commands over there. We can just like grab this using Go to make a tool. We're also using the Vault SSH helper to verify. So reading this thing and then we have the Duo PAM module that issues Duo prompt to you and in the end to make everything nice, we're using Yubikeys. So if you're not familiar with Yubikey's, they're USB devices, you just stop the Yubikey and drops a Duo code over there.
So conceptually this is how it works. You're on your laptop you have of a shell over there, use your AD creds to off with Vault. Vault calls your MFA service that goes through. You'll get your Vault opener machine. Use that Vault token to get Amazon STS creds. Using those creds you describe all instances in the account and you find the target host in Bastion. You use the Vault token to get OTP's from Vault. You use the OTP, you go to the bastion. The bastion cost two fold using the Vault SSH helper to valid the OTP. That's really when you have access to SSH it. That's fine it goes to Duo. Issues a Duo prompt to you. You respond the Duo prompt using second OTP. You go target host, target host does the Vault SSH dance, and now I have a shell. Now I can actually start debugging my outage. That's fine. So the tool that we build, it's called Archmage. So let's imagine we have two Hashistacks: We have -Prod and -Staging. The arch range auth and now I need to select my environment I want to auth with.
Let's say I select for the first one. I need to enter my AD password. I enter my AD password, I get prompted for a Duo code, I just press my Yubikey, and I'm in. Now in my console I have a Vault token and I have the Amazon STS creds. Now I wanna SSH to a Hashistack worker. I do archmage SSH and then I can press Filter. I want to filter all the instances in the account using the worker. So I chose the first one. And now I tap my Yubikey and I'm in the instance now so. We ... Using Vault for one time passwords. We have readable access policies over here, so you know which user can go to which host and also we have Amazon STS creds that are read-only for free from Vault. We have Duo and Yubikeys to make everything easy and we wrapped everything in duo to make it smoother.
So back to our security principle, making the secure thing the easy thing to do with just automating this basic workflow and minimize the required steps to SSH with MFA, and everything is an easy path for you to debug something in production.
Eliot: So something I'm super passionate about is defensive backups, because I don't feel like most people are ... Think about backups from a security perspective.
So, backups are just kind of like logging. You can't retroactively say "Oh man I really should have backed up that thing," right? It's one of those like 100% you can't screw this one up. The really scary one is how many, like if you're in the scenario don't raise your hand, but how many of you are in a scenario where you're the admin of both the backups and the data in the production account and so with your account you could delete everything. This is happened to real companies that no longer exist because we don't know them because they're gone. Don't raise your hands, I'll see. (laughs)
So like there's a real question of can you afford that risk? This is one of those heart burning things that I think about a lot.
So for backups, you're trying to balance this thing. On one side you've got just of the dollar cost of attempting to defend yourself against this, you've got your recovery time objectives, your recovery point objectives, and I'm balancing on the other side, are you trying to defend against a hurricane or like individual host failures, a compromised admin, or maybe like in some companies can afford to do the thing, what if Amazon wrote a bug that deleted all data out of all S3 buckets? That would be absolutely traumatizing right? So this is it's one of those like paranoid, you know heartburn moments. Am I certain that I've covered everything?
So OK again what does good look like? Well, for us, again this being a to be a specific example, you will need to use a similar one for Azure or Google—probably a separate account. Where for us, it means that no member of my team is an admin on the secondary account. Everything should be backed up by default. It shouldn't be an option that you have to say, "Oh please back this up." Nope, we're going to back it up, spend the money, like I don't take a risk. And at the same time our storage can be spread across multiple accounts. Accounts segmentation for various production environments, all of them need to be backed up, and at the same time the backup should be available for testing both the regular one and the Disaster Recovery style backups. So this is actually really easy to implement that we've actually tossed on top of the Hashistack.
So we start off, let's say like RDS, as an example. So you have RDS in the account here and we just want a simple Hashistack job that uses Vault and the STS credentials to just create a DB snapshot. Great, wake up every hour, create your new snapshot. We then share read access for that snapshot to a secondary account. From the secondary account we can run a lambda function that just says if you find anything shared with you, make a copy of it into this account. And then grant the read access back. So the end state here is that the primary account is able to have read access to both sets of backups, but it only has write access to one of the sets of backups. You need to do encryption of course. Same idea with dynamically created KMS keys, you can have separate the key access because again one of the loopholes is, if I deleted the key that's used for both backups I've also gotcha.
This also works separate accounts. Again with Vault you can just generate STS credentials and assume role to multiple accounts one after the other, creating these backups in the singular lambda function on the other side can just see that, and it's fine because the backups are only shared with the corresponding source account that they actually came from. So we end up just using a standard Nomad bash job, another thing we did up tossing on Hashistack. And the really critical crux of this is, at the end of the day, you take the only service account, or sets of service accounts that have access to this account. Print them off, lock them in a fireproof safe, give the keys away to any other team besides your team, and you could end up having this like clear split.
So for making absolutely nothing, the secure thing to do, again by assuming everything needs to have backups, taken it defensively and then sharing them and making them available to each account owner to then test their own backups, we're actually able to make certain that doing nothing is the right thing.
Vlad: OK so let's see how we actually secure our Nomad job files and Vault policies. So in our experience, we found two ways of managing a Hashistack. We have the centralized way of like, keep all the things in a git repo, they go for a CI process and then they get shipped to production. The other one is this decentralized approach. Every user has its own git repo, its own CI process and can deploy independently. Now imagine somebody gets popped and you have a bad actor in your network, what happens in that case? In the central as approach he tries to put something into GitHub, and maybe that gets past, goes into the CI and then gets deployed to prod. But in the decentralized approach you have the same pipeline, but you have so many users and so many instances of this, it just doesn't scale. And also with the decentralized approach you have a problem like who owns which container? You have an app on the Hashistack which has an outage, you need to track down the owner. Maybe can't even find it. Or like somebody deploys a container on the stack that eats all the RAM or maybe that container that was built four years ago in the Stone Age era, and you still need to track down the owner. We chose the first one, we chose to centrally manage our files.
So you have the user, you push to go to GitHub, it goes to the CI, it goes to prod. Here you have two places where you can stop bad things happening. You have the first gate which is before entering GitHub. You can stop bad actors there and then also in the CI, you need to stop code by getting into production. Okay, let's look at the first gate. To stop bad code getting into the repo you can use GPG to sign or commit. So you have your master branch and you fork, you do your there over there to your branch, you sign your commits to the Yubikey, and then you try to merge it. When trying to merge it GitHub is going to get the piece of who is going to try to validate your signature. If that passes, your code goes in the master branch.
However, we chose not to do this. We made a different service called Duo Bot. In our case you do your debt over there, you push it, and GitHub excuse the P.C. folk calls out Duo Bot. Duo Bot executes, calls the Duo API that sends you a Duo push. If you acknowledge the push, you have is going to merge this.
So let's see the difference between these two approaches.
- So for Duo Bot, it's easy to deploy, you can add it to many repos for GPG-signed commits.
- It's easier to revalidate the destination.
- As the cons, for Duo Bot it's actually difficult to use on an airplane—Duo can't really reach you if Internet's bad.
- And you have a weaker audit trail. You know that the user is from here, but you know just at that point in time when the commit gets merged. Afterwards you lose the you lose the audit trail. You need to go into the logs and find what happened.
- And another con for GPG-signed commits is permission scheme is actually more difficult. If you have more lots of members on the team, seeing who can do what, doesn't work well.
So to make everything easier we actually built a couple of GitHub bots. We built the first one which is an approver—a file path-based approval. So imagine changing something and then staging Hashistack in a file over there. GitHub is gonna call for approval going to approve by default on staging but on production, is going to post a status check and is gonna wait for an admin to do a plus one on on the PR to actually turn the status check into green.
And then we have Duo Bot. So Duo Bot keeps in state, the key which in our case is the SHA of the merge commit, and then also the value for that key is the, is the status from from Duo. If the push was acknowledged or not. Also we have Bulldozer, is an automerge bot, so it automerge's your PR in GitHub when all the status checks are green.
We open-sourced Duo Bot and Bulldozer, and we're working on open-sourcing Approval Bot.
So now going back to the second gate, the CI. What we can do over there, we write a couple of unit tests for for our job files and Vault policies. We enforce naming schemes, folder structure, health checks, we disabled certain use of drivers so you try to push a command that uses the exact driver the CI won't let you merge that. And also we do not allow Docker host networking and privilege containers.
Also another cool thing that we can do with the CI is try to converge to what this check in the git repo. So at this point in time you know the jobs that you have in the git repo, and the jobs that you have on the stack, so you add something new in the git repo, gets deployed production. You remove something from the git repo, it gets removed from production. And you do all the modifications between.
This CI pipeline runs as a housecleaning job every hour. This is an actual example: Here have a Nomad folder which has feed jobs it finds. Imagine having a bad actor on your network that ninja deploys something to production, which is not in the Gitrepo. The house cleaning job is going to kick in, is going to do the diff between what's here and what's actually deployed and it's going to remove the bad job from production. So ...
Eliot: I think that was my cue.
Eliot: All right so doing the summary of everything we talked about:
- We enabled daily automated patching in order to minimize risk of exposure for an attack. I now remember the slide that I skipped through by accident which, was the part where we've open sourced the bouncer code for doing this. Also available on github.com.
- We did MFA, so MFA all the things, MFA your repository with all your Vault stuff in it. MFAing your stuff through SSH by reusing Vault, not having to deploy more infrastructure.
- We did centralized logging with being able to handle ACL'ing relative to centralized logging.
- Defensively storing backups relative to multiple accounts so that no rogue admin or rogue credential can take out your whole company.
- And then at the same time protecting the Nomad and Vault files.
So thank you, and we do have some time for questions I think, eight minutes?
Crowd: Oh it is now oh. So if you're ... Whoa that's loud ... CICD with applying the Vault policies, how do you set up the trust to make sure something doesn't end up going through the whole CICD process with the bad Vault policy, other than just human review, perhaps.
Eliot: So it's a combination of you've got the code review, if you've got the enforced mandatory code review on one side. And by bad I assume you mean a malicious actor is trying to submit that commit. So it's, because it requires, code review for one, from other people on the team, and then the path based approval stuff allows us to highlight specific team members or security teams depending on what you're trying to change, because we can analyze it. And then the MFA on that side proves that it was the right user could do it. So it still has to be a person who has the merge, the correct GitHub permissions to then make that change. So A, you get the approval process but then you also get the ... That person had to have like that merge approval. And so those are the things as well as then, on the separate side unit testing checking for things that are against compliance, which now we can replace with Sentinel, which sounds a lot better. Does that answer your question?
Crowd: Yeah there's also the trust on machine, now you have a machine or service that has ...
Eliot: Yes. So this was asking how do we establish the trust between the machine the trying to do the CI process? We actually run that inside of Hashistack, so that actually runs inside the environment. So there's some other magic sauce that has to happen to get the connection to go through.
Eliot: The idea is that if you can protect the code that goes into a repository, by GPG signing or commits or by doing Duo or various components like that, then if you can verify that the content you have is the correct sign commit, then you can trust that that application is safe and then it's just amount of securing the environment in which are executing that part of the CI process.
Moderator: Is that it for questions?
Eliot: That was easy.
Crowd: So my question is about things that you didn't want to touch... was about adding to the access management thing, so how do you deal with the problem with systems, identity user, identities while you are doing the CICD pipelines.
Eliot: Which ... which aspect for ...
Crowd: Identity access management.
Eliot: On the host or in the CI system?
Eliot: On the host system you can do this through Vault policies, so Vault has its SSH backend and you can up-send the policies for that. On the CI side, that's just about identity access management on the actual tool of choice that you use for CI whether ... So that ends up all linking back to SAML in the end, or Active Directory.