Skip to main content
Case Study

How Autodesk operates its cloud apps as a factory

Are your apps special? Are you operating a sprawl of snowflakes? Are you dependent on the snowflake owners' knowledge? This talk shares how Autodesk avoided these pitfalls, and how they use LDAP, AWS IAM, Vault, Terraform and other tools to help.

The consistency of cloud APIs is the opportunity to turn your cloud into a navigable sea with data-driven management. The key is coming up with the organizing principles that truly simplify navigation and management and can incrementally apply data-driven tools to implement that organization.

At Autodesk we’re organizing over 100 customer facing applications in the cloud leveraging numerous platform services. This talk will share some of our organizing principles, and how we use LDAP, AWS IAM, HashiCorp Vault, Terraform, and other tools to implement them.

Speaker

Transcript

I was thinking of really going on the factory theme here a bit to talk about standardization and how we're using various tools, including HashiCorp tools to get there, but I ended up going to a different theme: Tetris.

It actually provides kind of a good metaphor in the sense that whether you're talking about public or private clouds, the real issue is the API's, the standardization, the regularity of the cloud changes the way we do everything on top. We just rely on it more and it simplifies us, and it gives us the opportunity to actually standardize a whole lot more on top. Much more than before. I don't think we really realize what it's done for us.

I mean, you look at developers, they move to the cloud very fast. Us infrastructure guys, eh. It took a little while, but why? The diagnosis to me whether it's trying to change an individual technology or use practice inside of the company, I find it's conflation. It's the way that ... it's not just that you're trying to do the right thing, it's that everybody is using that right thing, that same thing with different tools in different ways, in different verticals, so the combinatorics of trying to change this are just astonishing.

So, on with the game. Let's play a little software Tetris. It's not supposed to work that way. Evidently, I have to ... Thank you. This did not click the same way as my regular clicker. But, here we are. We're playing Tetris. We have some really awesomely simple shapes, they come down. They're pretty regular. We know how to use these things. What's magnificent actually, is the simplicity. We're all ready for the shapes. Now, how did we get to these shapes? First of all, we started out with simple blocks and things that were, shall I say, standard. And, like I said, we can understand how they are used and so forth. But as we get further on down the line, we'll find that we rely so much on the standardization that we can't do anything much more complex. So, all of a sudden a new shape arrives. How the heck are we ... Oh, we're getting ... Oh, we got a special thing. We can do some glueware here, okay?

We're going to do some neat things. We're going to keep up with this. That's a really special X, we like it a lot. That's really going to help our infrastructure. Well, in the end, it kind of gets rough, because yeah, we kind of made it through that one, but the next one? Where do I put that thing? Now, I've got another hole. I've got more of the regular stuff happening. The game goes on, but the real thing here is, we had forgotten about what that standardization is doing. All of a sudden, we get these guys. I'm going to do a good job trying to keep up with this guy, but after a while, you see more glueware, glue will fix anything, right? But, here we go. We'll finally get to the end of this, but you'll see in the end, we'll get F after F after F, and we will fail. This is not going to work. What do we do so that we don't end up like this? Game over, 'cause we fall behind. When the complexity gets so big, we don't have the resources to keep up with it and we don't have the fancy stuff. So, it's all about standardization.

So, then I try here, I started with the answer. I'm going to work back to the slide again later, but I thought I'd give you an idea kind of where we're going. On the left, we have really the things that are part of our company. What's standard about how we build out things. Then, all of this stuff on the right is execution. You'll see in this that again, this is a somewhat simplified picture, we're focusing on really the deployment process and deployment management. I don't get too much into the utility services. Utility services. Those are things like the monitoring, the log stashes, and all of those other things that go into what builds out an app. We spend a long time trying to just work through the different layers and standardize this. We spent a bunch of time thinking about how to have an umbrella. We want control. We want to own the umbrella. That umbrella then pushes on ... We have DSL's for whether it's ECS or whatever kind of tool you might have, you can use plain old docker, we have also just standard patterns that Terraform pushes, and a bunch of cloud native patterns.

Part of this is, I have hundreds of apps, customer facing apps I have to deploy, and I have to find ways so I can cover all of them, and bring them along for the migration plan. You see the standard patterns pushed down to a bait configuration of the OS on packer. Chef configuring the rest of it. And, we use Amazon as our cloud provider. In this middle, I have the black box which is the mediation. Access mediation is the key to how we use Vault very effectively inside of our company. So I'll talk more about that detail later.

I'll talk about how we got to the picture we have. We all know the usual issues. We are adding to that ... we'll call Bring Your Own Devops. We can all kind of relate to the fact that where we started, a bunch of vertical silos, everybody building their own app. Then we said, "Hey. Operations is going to take over everything horizontally. We're going to take care of it." Would we build another sideways silo that didn't really work well with the apps? The key here is how do we make it so that our cloud infrastructure team can support those services? Matrix that in so that each of the teams that builds apps, owns apps, runs apps, guarantees the reliability of those apps? You've got to do that in our world, a Sock2, SSE16 Type 2 compliant way.

There's a bunch of challenges around that, of course. Just the complexity scale, limitations of various technologies. You got all sorts of cloud functionality. Everybody's got keys galore.They're all over the place. People's laptops have all sorts of keys to accounts that if divulged because you went to the wrong website, etc. It's a hard problem to manage all this complexity because part of what you usually do in a customer facing environment is you just limit the number of people. There is so few people that really have true access, you just make those very careful. I have a hundred teams nevermind hundreds of people. Really we are looking to build that out.

When I say 'teams' there is an implicit thing here as we're talking about what Teraform or HashiCorp calls 'workspaces' but really, I've called them 'service stacks' or 'service stack instances.' I really need ... I have, inside the company, I have a very formal definition for this but to give you an idea, this is just a unit where you have tightly coupled components. By tightly coupled, I mean they are version coupled. I test them by version, I know what they are and things outside of it are, by definition, loosely coupled. They cannot be version locked with the specific entity. My world has got layers and layers of these things. For example, Sensu for monitoring the bottom layer. I've got a bunch of plumbing platform stuff at the next layer up. Then finally, applications we have a very systematic way of naming them. Systematic way of using those names for access controls, for monitoring, for wiring everything together. In fact, that very convention ... it's so enabling because if you look at Amazon, you will find that you can't even write access controls unless you write standardized naming for all of your apps. Sounds weird, I'll get into it later.

The key is each of these apps, they have neighbors, they have dependencies. They live in an environment but they are loosely coupled, have their own resilience, etc.

Again, like I said, how we're organizing what we do? Service stacks. If you look at that original picture I gave you, in your mind I'll return to it, you'll see that really it was this implicit thing. How did we make that work for this Build Your Own Devops world? You realize that it is organized around access control. Access control is kind of cool in the sense that, if you can't get there, you can't use it, therefore you must comply. Sounds onerous but the fact is, when you work through the access controls they are very freeing; if you get them right, if you get them simple. When we look back at our own past in our Amazon world, the number of access controls we have is just untenable. We can't count them. Never ... well, we could ... machines can count them. The real issue is we couldn't manage them. They were too complicated. In fact, for audit purposes and security, we really had to systemize all of that.

If I look at standards for this, every app stacking instance has these five run-time roles. Always five. There is an Owner, an Owner ... well, truly is an owner. They can do pretty much anything. They can take on any other role. What's neat about an owner compared to any others, they're the only ones that can actually get to the data. The Depoloyer, that can deploy apps. It can redeploy apps, it can create a backup but it can't look at it. It has all the rights to do all the actions to make things happen. The Executor, that's what each of the things running during run- time do. Those Executors, they can get to the database secrets. The Deployer can't. We have different layers of roles for different purposes. Obviously an Auditor can see anything whereas a Viewer can see most things that a normal developer would want to see.

The idea is every single stack has one of these pre baked ... let's call it a harness, which is now a company name ... it's a pre baked set of roles and security mechanisms that allow us to get all of our work done and to be able to handoff from an Owner. I can then have a Deployer do my deployment work and it's, by the way, isolated by application. Another major point though is that we have a set of utility services. To make this work, I have to be able to run my app in Devtest, staging and prod. All the same way, repeatedly. If I have something that runs in prod and I secure it in a certain way, I need to be able to do that in Dev, during the development process. I don't make them do it all the time but I make sure they can do it before they do any of the promotions up the chain.

The whole idea of improving speed is if I give them direct access to what it's like in prod, without the customer data, I can get the whole thing moving a lot faster. By the way, just by fitting into those standards, I give you at least 80% of what you need for security, audit, etc. We lock down versions, we do data collection, all these things in a very standard way. Once ... we are still working on some ... some of it's a little bit aspirational, getting version locking down for example. The fact is, once you have those streamlined, it actually gets really simple. Because all you do is, you say, "I gotta release." These are all the versions of everything. Including the Teraform, including any other components that you want to go any other orchestration elements. Including ... and we aren't there yet ... a version of what the framework of seamwork- the list of secrets that are used and how they are initiated. The whole idea is, I want to have that whole thing replicated. When I move it, the data is different but I have that entire chain of security actions, monitoring actions, all built and then moved.

It does have restrictions though. I'm not letting anybody create dynamic roles. There's certain can, ones that I can have on command but the whole idea is that roles and policies are out of the world of the app developer. But I have to give them other mechanisms to deal with it and I think we are. Again, the standards, the reservation's a pretty nice place to be.

Let's back up again. If I look at the context of what we were talking about, there's this whole idea of CI versus CD. I say versus separate from. The idea of decoupling these is really important. I personally ... if you ever get into Fed Ram or other places like that, the developer can't know about every deployment. They just can't. You need that kind of decoupling. We have tools like Jenkins which are great at CI. CI they pull lots of things together. They go through the testing and so forth but they lead to that nasty word I said at the beginning, "Conflation." Because then, all of a sudden, they are conflating all the CI with all the pipeline of CD and the processes and so forth. They assume access to everything and you end up with this one giant thing that bolts everything together in one way. We've clearly had to separate this. Now, one of the things you'll see on the left is 'build.' Build is always on the left. You guys that run Ruby ... we run Ruby ... have fun. It's really hard to get Ruby not to keep sucking in new worlds of gems along the way.

To me, to get predictability I have to know the version of every single thing that gets pulled in there, that gets locked down. I love containers. At the same time, there's some things that I really want the deployment templates, the configuration templates essentially for automation, I want them built in Dev. I don't want them parametrized in the deployment, so they can go anywhere. The real key here is to have quality metrics. How good is this thing is? Over here we have network and Access Controls. How we shroud them is very different by harness. How we do integration tests is also different. From our point of view, we are still aspirational in terms of getting that between these literally hundreds of applications. How do we make sure that we have fully automated, repeatable integration tests with high code coverage? We don't have that harness but getting those things there, what we need eventually to get to that velocity and the reliability of the cost. Because in the end, we can all talk about great things but we can't do it without keeping the costs down.

I've kind of wandered a little bit. I've gone from a picture of, "Here's a bunch of tools around deployment and the deployment process." Backed up and said, "Well, to do that well enough I have to think in terms of application stacks or service stacks. One way to organize it." The other aspect of organizing is making a type of clear separation of CI, CD so I can enable CD. Which is my focus. Then I have to do a little more hygiene. In terms of how I standardize and what I do where. I just kind of wrote this layer cake, yet another layer cake. We all know what cloud least common denominator is or worst common denominator. In other words, what's common across all clouds you might use? To me, that is such a poor starting point. You're not really taking advantage ... you're not able to even take ... you're not taking care of your costs if you are down at this layer. You might be getting the cheapest CPU second but that's about it.

When you go up the stack, you really have to use some of the cloud data capabilities. These clouds ... I mean, between Amazon, Google and Azure, at least ... they have awesome capabilities and they are in an incredible race. It's wonderful to see. It's helping us immensely. You have to take advantage of these things in, let's say, cloud standard patterns. There's certain things that we know and we organize around how things work. Maybe it's databases and services part of that container, service are part of that. There's also cloud data standards like ELBs and ALBs and AZs. If I go out there and I look at pretty much all products, they are AZ unaware. They do not help me take care of that aspect of what's going on in my network. Because they don't, I don't buy them.

When I get up to Ninja patterns, what I call ... clouds all have lots and lots of features. You gotta stay away from them because those are the things in the cloud adapters that say, "Hey, you have this nice, sunny place and we'll take care of all the complexity of turning "your network ideas" down into a layer on top of the cloud native stuff." First of all, you've got cost, you've got complexity, you got more security. Those are the things we, again, try to stay away from in our layering. We try and go from My Infrastructure to the clod standard patterns as quickly as possible as a means of getting all the things that we want in our hygiene. I've talked about all of that, I'm talking about alignment now with the clouds or cloud that you use. We tend to use Amazon. That alignment, by the way, also has this other part of it which is what I'll call Conway's Law. How you organize and choose software does depend on how your organization is put together.

Whoever deploys or builds each of these different pieces, you have to have clear connections between ... separations between who does what. For example, our change control ... all of this stuff on the left, is I'll call that "semi-static." That happens ... those services are really static. They create those standard five roles for every app service. I've registered the service. It has rights that are created or groups in LDAP. We have a change control process. Now to actually make this ... I hit pipeline manager as a box there because that's a missing component. Through that, now all the expectations are made so that the Access mediation can work right. Now, I'll go through exactly how that works shortly. It has the Deployer role for the deployments. All of these things run in the deployment role in the app which runs, really, in this sort of frame. That runs us the Executor role to access all of its components and so forth.

The standard patterns, you can understand what those are when they are sitting up in ELB and networking down to a set of web servers that include, shall we say, all the standard componentry for scaling and so forth. There are those ... we also allow cloud formation in some other mechanisms. Truly, we also have to have Windows mechanisms because I can't bring all my development groups onboard at once so I have to have some amount of wiggle room but I can still have an umbrella using Teraform that at least gives me some ability to control and interact with other things. You'll notice also I have app secrets in a different place from Access Mediation. So the app secrets ... so I look at Access Mediation as the operational secrets. That's where all the secrets for the cloud live. Where as app secrets might be the database password or the certificate I need to talk down an application layer pipeline. So those are the kinds of things that are there and they are needed by the application that run-time and have a different life cycle. Not sure if we're going to use Vault for that too or use Cloud Native.

We also, by the way, when I mentioned Standard VPCs, all of our networks are built out the same. They all have the same subnet sizing, same sort of naming conventions and so forth. That way, again, none of these apps have anything to do with building out new roles, policies or networks. That allows us to do the standardization. So while I say it's decoupling, it is decoupling in one level. Decoupling that the app works separate from the pro visionary network. The network ... but the app has to be standardized according to that.

Here I'll just walk through quickly what Vault Mediation looks like. You start with [inaudible]. Initially, this is the pre-provisioning, the semi-static stack, go to Service Now. The user's put in a group in Active Directory that allows, for example, deployer access. HashiCorp Vault knows that if I'm looking for this role in this account, I have to got to Active Directory and find out if you're a member of that group. What I've done now is I have all of my management of who has access to what in my Active Directory. I don't have users involved or in my cloud. I can do this with multiple clouds and I can clearly show, through Active Directory, my auditors who does what. Now when my user wants to access the cloud, they log into Hashi- into Vault. Vault checks and says, I would like this role. Vault checks to see if it's got that role. It does the duo, multi-factor authentication because we require it and it returns back temporary creds. That client can take on the deployer role.

By the way, it can work the same way from Jenkins except Jenkins doesn't use MFA for a service account. The idea is we use the same temporary passwords, the same kind of thing everywhere so that, first of all, there's some uniformity ... or at least that's what we're aspiring to. There's uniformity in terms of the context you write your scripts for and second of all, it gives us ability ... it's like the Teraform problem with content in your state file that you don't want. At least it expired in terms of the password so there's some saving graces there but to me, it's all about not exposing the real keys anywhere. From, obviously, that deployer role which can run literally on my laptop or on a machine in Amazon ... on bastion host. It then has that deployer role, it creates the stacks, upgrades them, updates them, whatever ... Each of these components are forced to be in the execution role.

This deployer can deploy anything as long as it's in that execution role in the specific VPC that execution role's allowed. If it tries to do anything else, it can't. That narrowness is really what enables us to now be able to bring in a developer because if you remember, I said I have to have the developer committee come in to my production environment and be able to work and touch things. I now have a way to isolate them at large, systematically because of the standardization we've put in here. We also take away a lot of problems. They don't have to worry about how to access their S3 buckets, we manage that. If they need to isolate things inside the buckets, there's specializations that they do have to do and there's standard patterns for doing that. The key here is we've provided the context to make those kinds of things much, much simpler. Again, that's the standardization that I think is really valuable.

I'm wrapping up now. If I look at it, it's about workspaces. The fact that I have developing ... my developers, each in a different workspace. I have them working together but I have that, shall I say shield around each one. I can populate this P, by the way, in this naming convention, that stands for production. Guess what? We have an S. We have a C, we don't have a D for Dev. C stands for corporate because this doesn't actually stand for how it's used. It actually stands for the security domain you live in. We have completely air-gaped environments and the idea is I move these things, literally, from air-gaped environment to air-gaped environment so that all the dependencies are pulled from our artifactory in that environment.

In other words, I'm building these ... this is replicated three times in our company so that I can guarantee the repeatability and the ephemeral nature of every app as long as I can hold the data and manage the data separately. Which, by the way, is it's own challenge but I'm able to meet a whole bunch of my targets and what's cool about this is, yeah, I've figured out a bunch of decoupling of tools. Let's say I have one of these tools fail, the good news is, the coupling between the tools isn't strong. My Jenkins, which by the way is not my tool of choice for CD, maybe that can go away here. I can use code deploy, I can trade things off but I can trade things off without having that conflation where one technology is so intertwined with all the others. Like Jenkins is totally intertwined with Git. I can move something around. That's my story. Any questions I can ... anything I can clarify?

Yes?

No, it's ... well, the owner role is like a system administrator for that one app. Okay? So they have a scoped value. On the other hand though, a general system administrator may not have the same access to moving, for example, backed up datasets around. That most system admins would because that would meant it got pushed off into some other quadron of the universe but the owner still has access to ... but as the sys admin of this main account, you won't. These are actually ... they are not strictly hierarchical. You ... when you go down deeper, you gain more rights for a narrower set of resources.

Any more questions? Sure!

Okay, yes. Do I have any horror stories of things going over the air gap and falling apart? Actually, not yet. That's because we are doing it for a subset of applications. I know the sneakernet approach is that everybody uses, "Hey I'm throwing stuff in semi-public S3 bucket and I'm sucking it up over there." Other magic like that, that you didn't hear from me. Seriously they ... there's a lot of ways around it that we see being done. They are common practice that most people don't even admit occurs. The fact is that, like for our Sock2 applications, we are pushing this and we've had to go trough some headaches to get there. To make them fit into that norm. I'd say the largest headaches we've had are because we're using artifactory, we actually have to do a copy to copy. We can't just change an administrator pointer. With that, that can take time and sometimes if ... we have various policies. Some things are automatic, sometimes I have to have a human actually okay it. I would say that's probably one of the bigger headaches.

Well, thank you very much. Appreciate your attention. Have a good night.

More resources like this one

2/3/2023Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

1/5/2023Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

12/22/2022Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

zero-trust
12/13/2022PDF

A Field Guide to Zero Trust Security in the Public Sector