Presentation

Everything as Code: The future of ops tools

How do you manage containers, VMs, bare metal, and cloud services in a modernizing software business or enterprise? Seth Vargo, a Google developer advocate, has the answers for reducing the insanity and finding the right tools to manage operations with code.

Today you can fire off an API call and have more compute power at your disposal than existed just a year ago. But to take advantage of this power, you need automation; and you need to be able to manage your infrastructure programmatically—at every layer.

You need powerful operations tools. You need to write code.

Years ago, we used to treat servers as pets (we'd even name them!) But now, the most competitive software firms are operating dynamic, hybrid infrastructures filled with containers, VMs, bare metal, and containers on VMs on bare metal that they might not even own. The proliferation of X-as-a-service (-aaS) systems (e.g., CDN, DNS, database, monitoring, security, and object storage cloud services) increases the complexity of this situation even further. The only agile way to manage this type of infrastructure is by treating servers as cattle, not pets—and we need programming code to act as the herder.

Seth Vargo, a developer advocate at Google, provides a map of the modern tooling ecosystem that operations and DevOps engineers need to understand. The categories include:

  • Configuration management - Chef, Puppet, Ansible, SaltStack
  • Containers — Docker or other Open Containers Initiative (OCI) compliant runtimes
  • Infrastructure as CodeTerraform
  • Continuous Integration (CI)/Continuous Delivery(CD) — Jenkins and the Jenkinsfile, Travis CI and the .travis.yaml file — Managing security and policy at scaleVaultContainer orchestrators — Kubernetes, Nomad

Vargo also lists the advantages that programming code gives you when managing operational complexity:

  • Linting, static analysis, and alerting
  • Testing
  • Collaboration (a common language)
  • Separation of concerns
  • Model abstract concepts
  • Theft (stealing ideas from the application development workflow)

And what is the future of infrastructure as code, if we keep following the innovations of software developers over the last decade?

  • Less operator intervention
  • Deeper automatic scaling insights
  • Automatic security scanning and pen testing
  • Automatic security patching
  • AI & machine learning in operations (log analysis and anomaly detection)
  • Machine learning-based distributed tracing
  • Serverless / Functions

Vargo believes that another major paradigm shift will happen in operations when organizations move from a culture of fighting fires to a culture of starting fires.

Speakers

Transcript

It's the final talk of the conference. Who's excited?

Awesome. I am the only thing that starts between you and alcohol. I recognize that, so, I will go as slow as possible. I'm just kidding. Today I'd like to talk to you about Everything as Code. And for those that have ever seen me give a presentation before, you know that I like to do live demos. So, I'm just going to set expectations from the beginning that there are no live demos. I'm sorry, but there aren't any. But that's okay because the content is still really interesting. Thank you, Nick, for the introduction. For those of you who don't know me, my name is Seth. I'm a developer advocate for Google. Prior to that I worked at HashiCorp.

Many of you may recognize me. I wrote tools like Consult Template and a bunch of other things. But today I work at Google primarily focused on the HashiCorp tools and the DevOps Ecosystem to make sure that GCP (Google Corp Platform) is a really excellent place to use these tools. But I don't really want to talk about that today.

Operations evolution from 10-15 years ago to today

What I want to talk about is code, and particularly everything as code, and what's next as code. For a little bit of background, whenever we time travel back, say, 10 or 15 years—so, if we were to go 10 or 15 years into the past—operations was really easy. Relatively it was hard,but it was really easy.

We had one server and one data center and we called that server a mainframe. And it was very large and it required a team of people, but those teams were actually managing the server in a very mechanical layer. They were fixing the glass tubes that would explode and rewiring things. These servers were huge and they had punch cards that you programmed on them, but they were relatively easy to maintain.

Then we take a step forward and our data center has moved from mainframes to servers. So, we go from having this huge machine that takes up an entire room to blade servers that we can put in a rack and we can network them together. This was still relatively easy to manage by hand. Most companies, even large Fortune 500 and Fortune 50 companies would only have 10, 15, maybe 100 of these servers. You could name them, you could treat them like friends, you could invite them to dinner. They were easy to maintain by hand, even with a relatively small number of people. Each of these servers generally had one purpose. They ran one COBOL application, one Java application, and they had decent CPU and RAM but they weren't being used to their full potential.

But then the world got a little bit bigger. A lot of things happened. We became global. The economy became global. The need to connect systems internationally became global. We had to deal with things like regulation, where the content that I serve to people in a country has to differ based on the laws of that region. Additionally, we want to deliver content closest to the request. If I'm based here in Europe I don't want my request to have to go to a data center in Australia and back again in order to receive a response. We want to deliver content and process content as close to the edge as possible.

So, we started building out multiple data centers, both for disaster recovery but also for the ability to deliver responses and process requests as close to the origin as possible. This was great but it was slowly becoming impossible to manage this by hand. Now I have to get on a plane to tend to my servers, or I have to have multiple employees in multiple different time zones to tend to these different data centers. Another thing we recognized is that we have these huge fleets of compute, and we would look at the dashboards and we would see they were using 10% of our CPU and 5% of our total memory capacity.

This is where hypervisors came along. So, instead of having one application on one server, now we're able to virtualize the operating system and virtualize the kernel, and we could provide an isolation layer that was never really before possible. So, we can isolate things in a secure way or in a more secure way, and we can basically have 15 or 16 different VMs running on one of these servers.

Then things get even more complex, because in this model we're running on shared hardware and we're relying on isolation of the operating system layer, and this definitely requires a team of people, it definitely requires some type of automation. We went from thousands of servers to tens of thousands or hundreds of thousands of VMs overnight. This is where tools like Chef, Puppet, Ansible, and Salt really come in to help manage the complexity of an individual system, of one VM, of one server. And tools like Terraform come in to help manage the complexity of provisioning all of this infrastructure and keeping it up and running.

The many challenges of modern operations

More recently we've entered into an era that looks something like this [5:16], which is this hybrid collection of containers, and VMs, and bare metal, and containers on VMs on bare metal that you may or may not own, and it's a really, really complex world.

It doesn't actually end here, because with the proliferation of containers and microservices, we also see what I like to call the proliferation of -aaSes. That's where you have your software as a service, your platform as a service, your database as a service, your object storage as a service, your DNS as a service, your CDN as a service. So, instead of you being a company that runs a globally distributed content network, you can offload that onto a service provider, like a cloud provider or someone like Akamai that provides a CDN as a service. But just because you offload that work doesn't mean that the configuration goes away. You still need to configure the caches, you still need to configure the content expiration.

We live in this amazing world where if we go back in time to 15 years ago, literally none of the compute is possible. In my pocket I have more compute than was ever available 15 years ago on something like a mainframe. But we also introduced so much complexity in this new kind of architecture, and it's this complex mix of these different tools and these different technologies, and trying to find a way to bridge the gap between legacy applications, and cloud native applications, and VMs, and containers, and on-prem and cloud. And we need a strategy for managing this complexity.

What's worse is that this diagram doesn't even include some of the most critical parts. This diagram doesn't include anything about security. It doesn't include anything about policy, or compliance, or regulation. There is no monitoring, logging, or alerting. These are all third-party subsystems that we have to maintain, we have to keep up and running. Who monitors the monitoring system? How do we integrate all of these components together in a global way?

The point here is that we live in a world that demands tooling to manage complexity. It is no longer an option. You have to have tooling and automation in order to manage this complexity. You will go insane by trying to manage this by hand.

Why did we make things more complex?

We have to ask ourselves why? Why did we make things more complex? It seems like mainframes would have been easier. Yeah, they were slower but they're easier, and it turns into what I would like to talk about, which is the APUD cycle. How many people are familiar with the APUD cycle? You shouldn't be because I made it up. The APUD cycle is very straightforward. It corresponds to these four pillars of the evolution of infrastructure. Acquire, Provision, Update, and then Delete or Destroy.

When we take a look at the modern world we live in and this hybrid data center and multi-cloud world, we live in a world where, in the past, in order to acquire compute we had to pick up the phone and call a vendor. And that vendor would then process a purchase order, and then like six to nine weeks later we would get boxes that had cardboard packing in them, and we would have to unpack them and put a server in a rack, and screw it in, and connect data cables. We don't live in that world anymore. But that was a real world, and there are still people who work in data centers that are still doing that. They're unboxing Dell and IBM servers and putting them on a rack.

Then we had our data center operations team that would come in, and once that server was connected to the network, or the local network, or whatever you might call it, they had to provision it. They had to put the initial users on, put on the initial operating system, the initial software packages, etc. Then there was probably another team that would go along and manage that system over time. "Oh, open SSL is vulnerable for the fourth time this month. Got to update again." There is a team that's constantly managing that.

And then we have our data center operations team, which was responsible for decommissioning those servers. 15 years ago this was a very painful process. The vendor acquisition process would take weeks if not months to go through legal, and purchasing, and shipping to acquire new compute. The process of data center operations could take days or weeks depending on the backlog just to provision a new server. The updating process would also take hours or even days, depending on the backlog. And the decommissioning process or the destroy process would take days.

So, what changed? Well, probably the biggest thing that changed is the introduction of cloud. Cloud technologies and hybridization technologies have allowed us to start treating things like compute, networking, and storage as resources. We don't have to think about the underlying physical machines anymore. When we look at cloud providers, cloud providers helped shift the acquisition of compute, storage, and network from weeks or days to minutes and even seconds. I can fire off an API call right now and have more compute than ever existed a year ago, in a single API call. That's what's available to us today. Configuration management tools took the middle part there, the provisioning and the updating down from days to minutes or even seconds to update these systems and keep them updated.

So, we have strategies for managing this complexity. They're kind of all over the place and there's a lot of crossover between all these disparate systems, but there's also a heck of a lot going on. [10:50] This is a word cloud of the 2017 top Hacker News posts filtered by IT. If you look very closely, the word buzzword is on that word cloud. But if we just take a look at some these here, we have obviously Kubernetes, Terraform, Docker, Cloud, DevOps, Serverless. There are so many things going on, and there are so many tools and there are so many choices that sometimes you feel like your head is going to explode. Why do we feel like this? And more importantly, what was our original goal?

It seems like mainframes might have been better. Mainframes did not have all of these problems, yet we're in a world where we demand tooling. It's not optional. We have so much tooling and so much automation just to get through our daily lives. So, like I said before, we have to have a strategy for managing this complexity, and time and time again that strategy has been code.

Codifying your infrastructure

This vectors in from the application world, we see this time and time again with tools in this ecosystem where code is always the natural strategy for managing complexity. What is codification? Well, the term "codify" is very straightforward. It means you're going to capture a process, routine or algorithm in a textual format. It's basically "write something as text." That codification may be declarative or imperative. Meaning it may describe the end result, or it might describe the series of steps to take. For example, a recipe to make a cake or a casserole is an imperative. It says, "Do this, then that. Step one, step two, step three."

Some systems are declarative. They instead just say, "This is the desired state. I don't care how you get there." Terraform is a great example of this. In Terraform, everything is parallelized by default, so you don't actually have a lot of control over what happens before other resources unless you explicitly require those dependencies.

Let's look at some existing ways in which we manage complexity with code.

Configuration Management

Configuration management is obviously one of the biggest ones. To manage the complexity of a single machine or operating system or VM, we have tools like config management—tools like Chef, Puppet, Ansible, and Salt that codify and automate a machine's definition. Chef recipes. Puppet modules. Ansible playbooks. Whatever Salt calls their things. Those are the codification of these systems. Then Chef, Puppet, Ansible and Salt are the technologies or the tools that enforce that codification. They might be executing a Python file or reading a Ruby script. It doesn't matter what the implementation is but they're the enforcement of that codification.

Containers

When we look at containers, things like Docker and OCI, we have a tool for managing the complexity of application requirements. My application needs this exact version of Python with this exact set of system dependencies. Containers are a great way to help shift that application and all its dependencies in one. The Dockerfile is the codification. That Dockerfile is the textual format which is capturing all of the application and its dependencies. Then Docker, or OCI, or whatever your container runtime is, is the automation that is enforcing that Dockerfile and it's building and running the application.

Infrastructure as Code

In the infrastructure as code world, we have tools like Terraform which manage the complexity of infrastructure at scale. Terraform configurations are the codification. They're a single text file or multiple text files that describe infrastructure and the relationships between them. Then we have the tool, again Terraform, which is applying those configurations. It reads them and enforces them to bring about a desired result in the correct order.

These are the ones that we're probably most familiar with. But we're starting to see some emergence in other spaces.

Continuous Integration (CI)/Continuous Delivery(CD)

In the CI/CD world, for example, there's the Jenkinsfile and the .travis.yml file where now we're using code to describe our build systems. Previously, if you've ever used Jenkins in the past, you know that most frequently Jenkins is configured via the web UI. But now there's a desire to start configuring and capturing these configurations and the build steps and the build output as code, and versioning that with the application. These examples, the YAML file or the Jenkinsfile, those are the codification. Then the tool, Jenkins, Travis CI, CircleCI, that's the automation or the tooling that is driving the instantiation of that file.

Managing security and policy at scale

We also see security and compliance complexity with things like APIs and Policyfiles. As the surface area for microservices grows and gets larger and larger, the desire to secure that becomes difficult to reason about. We need a tool and a strategy for managing the complexity of security at scale. It's very difficult to reason about these things. For example, Vault is the API and the configuration is the codification which describes how our services should be able to get credentials and secrets. Then Vault is the automation or the enforcement of that policy, of that code.

Container orchestrators

Then obviously we have container orchestrators like Kubernetes and Nomad where you have a YAML file or an HCL file that you submit and the orchestrator runs that and executes that in a manner such that the end result is an application or a service or a load balancer that is running.

Advantages of using code to manage operational complexity

There clearly exists this incredibly well-defined pattern for using code. Applications use code all of the time, but we've seen it in config management. We see it as infrastructure as code. We see it everywhere, which begs the question: Why? Why is code such a valuable tool for us to manage complexity? It turns out that there are a few really important reasons.

Linting, static analysis, and alerting

The first is linting. Once something is captured as text, as code, we can enforce our own opinions on that code. We can do things like static analysis. We can do linting. We can do alerting. We can go as little with a regular expression or as complex with machine learning as you want. But that analysis of that code leads to linting and that linting can enforce consistency. Especially in a large organization or a broad community, consistency is a key component for adoption. If you want a tool or a technology to succeed in its adoption, it has to be consistent. You can't have team A doing something some way and team B doing something another way and expect them to be able to collaborate effectively. You have differing opinions. This is where tools like Go and et cetera have built-in formatters to help enforce that and alleviate those arguments.

Testing

Once you have linting, you can take that a step further and bring testing. The moment you capture something as code, you have a significant ability to test those configurations, whether it's just a client side test which might be a lint and then you fail your exit one if that lint fails, or it might be something more difficult and time-consuming. We might be able to spin up an entire copy of our production cluster in a different environment and a different region, run a bunch of test against it. But because we've captured that as code, not only do we guarantee that we're getting the same result, but we can iterate at it and automate it over time.

Collaboration

Code also gives us collaboration and this is a really key benefit for capturing something as code. Once we have a common format, once we're speaking the same language, we can do things like pull request, change request, merge request. We can automatically test integration with things like CI and CD workflows to automatically build a workflow that works for collaboration.

Collaboration is a key piece of why we capture things as code. I wouldn't say it's the driving motivator. But when we talk about complexity and the reason why we're trying to manage these complex things, it's that it's difficult for one person to reason about these. Having collaboration, having checks and balances, having the ability for people to be on the same page and share the same ideas is a key reason why you might capture something as code and capture everything as code.

Separation of concerns

On the exact opposite point, we have separation of concerns. A lot of these tools provide some sort of modularization. Chef recipes. Puppet modules. Terraform modules. They provide an isolation layer where if we're in a large organization or trying to support a community, we can define our problem domain very specifically. We don't have to solve the world. We can solve our domain and we can say, "This is the set of problems we solve. Here is the module or the RubyGem or the Python package that does exactly what we say it does." It's your job to use that.

I call this the Lego principle. We can't all be experts of every domain. We can only be experts in red Legos with four dots. You pick your red Lego and you build the best red Lego that you can and you expect other people to build the Star Wars spaceship that's awesome and amazing or the roller coaster because you can't possibly reason about all of that complexity and be a domain expert in all of those fields.

Model abstract concepts

Code also allows us to do really cool things with modeling abstract concepts. When you take something and you capture it as code, there are a lot of third-party resources and tools that let us then visualize that code. Take a look at the Terraform tool, for example. You can graph all of the relationships between all of your different nodes and dependencies in Terraform and this is a great way to visualize all of your infrastructure in one PDF or DOT file, but you can actually take the output of that, which is DOT, which is an open format and feed that into other tools.

In the academic world, there are tools that accept DOT input and provide a 3D visualization world where you can actually explore your infrastructure in a 3D manner instead of a 2D manner. This is a really helpful way to model relationships and dependencies in this incredibly complex world. When it was a mainframe, everything was right here. But now we have things all over the world with different orders and different egress and ingress and we need a strategy to be able to think about that. Sometimes we struggle to just think about these things.

What's next for operations engineering?

Really what am I saying? What I'm saying is that the entire talk is actually about theft. I am 100% encouraging larceny, not grand larceny, just mini larceny. What I mean by that is if you take a look at what we're doing today in the infrastructure and the operations space, it's everything that application developers have had for the past 5 to 10 to 15 years. CI/CD, not a new concept for application developers. Code, that is application developers. Source control, pull request, collaboration, these are all things that came out of the application developer workflow. This idea that we should be working together, this idea of breaking things done into microservices, these are all coming from the application development workflow.

We have to ask ourselves: If what we see happening now in the infrastructure world is what happened 10 to 15 years ago in the application development world, what does infrastructure look like in 5, 10 and 15 years? What is next? For the rest of this talk, I'm going to pose to you that based off of what we currently see in the application landscape, what is next for infrastructure? To date, to the best of my knowledge, none of this exists.

Less operator intervention

At a very high level, the number one thing we're going to see in the operation space is less operator intervention. There's a famous quote that's like "my job is to make myself obsolete." I think more and more as technology evolves, we're going to see less and less operator intervention. Operators are going to be creating fires, not putting them out. We look at companies like Netflix that have chaos engineering. I think they just renamed it. But this idea of instead of just constantly responding to fires, my system is so stable that I now inject fires in order to see how my system responds and is resilient.

Deeper automatic scaling insights

To a certain extent we have auto scaling. Auto scaling exists today. But not at a level where we have deep application insights into exactly what we need to scale. Do we need to scale CPU? Do we need to scale RAM? Do we have direct insights between our monitoring, logging and alerting to know exactly what we have to scale, exactly when we have to scale it and for how long based off of historical data? If I'm an eCommerce site, can I preemptively auto scale on the holiday season so that I can maintain capacity? I want to be proactive, not reactive. Right now, auto scaling is almost entirely reactive. You know, based off a current ingress from my load balance kickoff, some type of auto scaling. I want to be proactive. I want to already have the capacity before the load hits. So that's what we're going to start see coming in the next 5 to 10 years.

Automatic security scanning and pen testing

Another one is automated security scanning. You're like, "Wait, this exists today." Yes, automated security scanning exists today. Many cloud providers provide some level of automated security scanning for containers and applications. There are tools like Black Duck, which will do analysis of different software and software dependency packages, but they're not in a state yet where they know everything, and they primarily rely on CVEs, or reports from external people to say that software is vulnerable. They're not using fuzzing, they're not actively pen testing your applications. Instead, they're pulling from a CVE database, parsing your entire dependency tree, and saying, "Okay, 15 layers down the stack, there's a vulnerable version of a package, you should update it."

And you get an email and you wake up in the morning, and you're like, "Oh, I should patch all of my systems." That's really great, and that's going to get a lot better, because we're going to start seeing these security scanning tools really start integrating with things like fuzzing; where no longer are they just scanning a vulnerability database that a human put some type of CVE into, but instead, they're actually penetration testing. It goes back to the chaos engineering. No longer are we actively put out fires, we're trying to cause them so that we cause them in a controlled environment before our users can, and before our users can be brought malicious harm.

Automatic security patching

So imagine a world where you wake up and you get an email from some automated system and it's like, "Hey, Seth. 50% of your applications have a vulnerable version of this package. You should patch them." That's a great world. Some of that exists today, but it's not in a mature state where we can actually rely on it with confidence. So, let's say we get to that world and I wake up. That's still an operator. That's still an operator who has to go SSH into systems and push Chef configs or Puppet configs everywhere. There's still a manual task that has to take place.

The next logical step is automated security patching, and this is not happening to the best of my knowledge today, and the reason I think this isn't happening is not due to lack of technology available, it's due to lack of willingness to change culture. How many people here would feel comfortable if tomorrow morning, you woke up from an automated email from your logging system or monitoring system that said, "Hey. There was a vulnerable version of a package. I patched it on 98% of your fleet, and the 2% that weren't patched, I removed those nodes from the load balancer, so they're not servicing requests. Let me know when you fix them."

How many people would be comfortable receiving an email like that? That's some Skynet shit right there. As an industry, some of us are like, "Yeah! All in! I'm so in on that." And then a bunch of people are like, "But job security." There's a cultural movement that has to happen where our job as systems engineers or systems operators or DevOps engineers, or whatever your business card says; has to shift from putting out fires to causing them. Only when that happens, does something like automated security patching make sense. Only then will we be confident enough to say, "Oh yeah, that's great and then I'll deal with it in the morning." We have to be comfortable with these systems using automation.

AI & machine learning in operations

The last thing we're going to see is more intelligent insights, and this is probably the deepest thing now. If you were to look at, "What is the biggest thing right now in the developer landscape?" I think many of us would agree that it's AI and ML. If you're applying for VC funding, you have to put AI, ML, or blockchain somewhere in your pitch, or else you're not going to get funding. I refuse to talk about the blockchain in any public manner, so we're going to talk about AI and ML.

Artificial intelligence and machine learning are really, really hype in the application developer landscape right now. I can, on my phone, take a picture of a receipt and it uses AI and ML to decipher all of the text and uses OCR to auto-submit that, so I don't have to type in numbers. That's a great world, but can we leverage that technology in the operations space? What does that look like? What does TensorFlow in operations look like? So instead of analyzing pictures to decide which one is the best of your cat or your dog, or instead of analyzing photos to find: "This is mom, this is dad, this is kids." Instead of using tools like TensorFlow to analyze photos and text, what if we're analyzing logs and metrics, and we're using machine learning and AI to do intelligent alerting based off of advanced heuristics that aren't available based off of some regular expression engine or some complex rule-based thing we do now.

There are certain things that humans are really good at, like empathy, and there are certain things that machines are really, really good at, like anomaly detection. There's a really fun video on YouTube. How many people have ever played those "Spot the Difference" picture games where they give you a side by side picture and you have to tap on the differences between them? On average, with TensorFlow, we can solve that less than 100 milliseconds. But humans? The clock runs out, you lose every time. It's a game.

Machines and computers are really, really good at anomaly detection. So when you have 100,000 log lines that are spewing past you because you have that many microservices distributed across the world; how do you find that one log line that looks different? That one log line where, "Oh, there was a blip." Maybe that blip only happens every once in a while, but it led to a poor user experience. Or worse, that blip is actually an attacker or a hacker, or an intruder who is doing some type of malicious operation, but there is so much data; there is so much noise that you're never going to see it, and an attacker is relying on that fact. So being able to use AI and ML to ingest all of our logs and all of our data and actually point us to the things that matter; you might argue, "Well, you shouldn't be ingressing log data that doesn't matter." That's a whole different discussion, but we need anomaly detection. As our systems scale, as our services scale; as the amount of data and the number of users scale, we have to rely on technology to find the anomalies.

Machine learning-based distributed tracing

Next, we're looking at distributed tracing. So, we already have tools like OpenCensus that are really great at doing distributed tracing, but applying AI and ML to distributed tracing to detect rogue actors; what does a normal request path through my applications look like? Now normally, they hit microservice A, then microservice B, then microservice C, and then the database persistence. And given that data, and I learn using machine learning on that algorithm; I pipe that into my OpenCensus, or whatever I'm using for distributed tracing. If all of a sudden, I see some traffic that's going from microservice A to D to Q, all the way back to the database, that's anomalous. That's something that I should alert on, because that could be a malicious actor in the system. It could be a bad deploy. It could be a misconfiguration. But these are ways in which we can use AI and machine learning. Again, technologies that application developers are already leveraging today to bring this to the infrastructure and the operations world.

Serverless / Functions

Lastly, everyone's favorite thing is functions or serverless. We're seeing a lot of adoption from application developers using serverless functions. I hate the word "serverless", it's why it's in parenthesis here, and I say functions instead, because it's not serverless, it's someone else's servers. I've seen them. You just don't have to manage them anymore, and that's okay, and there's a lot of benefits to using serverless. There are cost-saving benefits, there are time-saving benefits, but we introduce a lot of problems when leveraging serverless technologies as well. When that serverless function dies, it is gone, and you are now relying on historical logs and data to figure out what the heck happened inside of that function or inside that stored procedure.

Accepting risk

All of these tools and techniques and functions are really just to way to off-load operations. It goes back to the crux of my talk, which is: We're trying to capture everything as code, and the reason that we're trying to capture everything as code is so that we can evolve to the next operator. What is systems administrator 2.0? What is DevOps engineer 3.0? It is only when we switch from putting out fires to causing them that that shift will happen.

In order to stop putting out fires, we have to embrace code. We have to embrace automation, and we have to embrace new technologies and embrace an acceptable level of failure. One of my favorite analogies is: When we talk about systems availability, and someone says, "My system must be 100% available." There is no such thing as a 100% available system, because there inherently exists unreliability in the ecosystem between your user and your data center. If they are on any home network, whether it's fiber or cable or a dial-up line, there's an SLA that content provider is giving. Maybe it's three 9's or two 9's of availability. If you're on a mobile network, there's an SLA for the availability there. So even if your service is 100% reliable, our end user is never going to see that.

Between you and me, they're going to blame Verizon and Comcast far before they ever blame your application. So, accept risk. We have to be willing to accept risk. Right? Define what is an acceptable level of availability for your application so that you can be risky. You can automate things that have never been automated before. You can try new things, and if you go down for five seconds, you go down for five seconds. Unless you're in a world with mission-critical, time-sensitive data, that's okay. We have to be willing to accept that.

So, to conclude here today, what's next for operations is everything as code." I think you're all at this conference here today because one of the philosophies of the Tao of HashiCorp is: Everything is code. It's a key pillar of all of the products that HashiCorp builds. You already believe in that, but some of your peers might not. So everything is code, everything has to be code; whether it's security, policy, automation, compliance, infrastructure. It all has to be code for all of the benefits we talked about.

Once it's code, we can start moving from putting out fires to creating fires with less operator intervention, and once we're intervening less on the type-y type-y, we can do more on the strategic things; more intelligent insights. Our systems are only going to grow in complexity. Nothing has gotten simpler in the past 20 years. That was the whole point of that beginning. Mainframes were far simpler. We're only going to see increasing complexity, and we need to leverage new technologies and new techniques to manage that complexity.

And lastly, we're going to see a significant rise in server lists or functions, or this idea of, "Just run my code, and I don't care about the system that it's running on or the requirements of that system." Thank you very much for having me here today. I'll be around if there are any questions. Thank you for coming to the conference.

More resources like this one

  • 4/11/2024
  • FAQ

Introduction to HashiCorp Vault

Vault identity diagram
  • 12/28/2023
  • FAQ

Why should we use identity-based or "identity-first" security as we adopt cloud infrastructure?

  • 3/15/2023
  • Presentation

Advanced Terraform techniques

  • 3/14/2023
  • Article

5 best practices for secrets management