Presentation

The Ecological Impact of Compute

In this presentation from HashiConf 2017, Seth shows how cloud is doing its part to reduce the carbon footprint of the digital economy. He also shows how HashiCorp Vault and Nomad can help.

The environment is critically important to our jobs and to life itself.

In a similar way, data centers are the backbone of the modern economy. Without data centers, the world would be almost unrecognizable. Yet in 2017, they used about 91 TWh of energy, emitting 100 million tons of CO₂e—the same amount as 26.5 million cars.

But it’s not IaaS that’s the problem: On average, private data centers are about 10 times more polluting than public cloud services [source: NRDC]. Cloud is actually doing its part to reduce the carbon footprint of the digital economy.

Seth also describes how products such as HashiCorp Vault and Nomad can help.

Speaker

Transcript

I want to talk about the environment today. When I talk about the environment I'm not talking about production or staging, I'm talking about trees. I want to talk about an environment that has been around a lot longer than some of your production environments, and that's our planet. I'm challenging everyone here to ask ourselves, how does the work that we do impact the world around us from an environmental strategy?

First, I have a few disclaimers. I am not an environmental activist. The purpose of this talk is not to convince you that climate change is real or any of that. All of the data I'm about to present you is fully researched. It comes from the National Resources Defense Council, and you can look it up on your own. I'm also not the epitome of environmental protection. I drive a car that allegedly gets 19 miles per gallon. It is fully gasoline powered, it is not a hybrid. I'm telling you this so that you don't think that I'm preaching to you. I'm giving you statistics, and you'll see why in a minute. I'm not super environmentally conscious either. I keep my A/C at 70 degrees all the time. It's pretty comfortable.

The reason I'm giving this talk is, just by a show of hands how many people have been directly or indirectly, through family, friends and loved ones, affected by Hurricanes Harvey, Irma, Matilda, the earthquakes in New Mexico, the wildfires in the Northwest? How many people have been directly or indirectly affected? About half of the room. The environment is something that is incredibly important to us. Without it we literally can't do our jobs. I want to talk about that a little bit.

First, some background. Behind me, this is a 3D graph of all of the internet traffic in the world of 2015. This screen is 50 feet wide. If anyone wants to see it after the conference I'm happy to show you the full graph. It's about four times taller than this and four times lower than this, so it's a massive graph. It really closely resembles membranes and the human brain. That's because just like the spinal cord and the brain is the data center of the human body, data centers are the backbone of the modern economy. They literally fund and run everything. Your phone, your laptop, your job, everything is run through a computer. This talk that I'm giving right now is being recorded on a video that is running a little microprocessor which is controlling all of this.

Data centers are really the backbone of the modern economy. So much so that every 60 seconds we have 204 million e-mail messages, 5 million Google searches, 1.8 million Facebook likes, 350,000 Twitter tweets, $272, 000 in amazon purchases, and 15,000 music downloads. People still do download music. That was weird to me. Each and every one of these requests need to flow in and out of a data center, and those data centers require power and energy in order to fulfill those requests.

What's even worse is this is the most recent data and it comes from 2014. If we apply Moore's Law we can reasonably say that these numbers have more than tripled since 2017, so that means most likely we are seeing over 600 million e-mail messages sent every minute. That accounts for roughly 91 billion kilowatt hours of power. I'll let that number sink in for a second, 91 billion kilowatt hours. That's roughly equivalent to 500 megawatt power plants stacked on top of each other 34 times.

To put that in a little bit of context, if the internet was a country, its total power consumption would live somewhere between Italy and Spain. The internet consumes slightly less electricity than Italy and more internet than Spain. Think about that for a second.

It accounts for 100 million tons of carbon pollution on an annual basis, which is roughly the equivalent of 26.5 million automobiles, gasoline automobiles, and would require 470 million trees planted to offset carbon emissions, almost half a billion trees to offset the carbon emissions from the e-mail you send every day.

Let's think about this for a second. Well, we see this increase in public cloud and we see this increase in carbon footprint, so one might think, "Oh, well obviously this move to cloud is causing this increase in carbon pollution. Everyone can just get pushed to Heroku and they have their app running, we don't care about performance." This is what we call inferred statistics. As we know, the number of films that Nicolas Cage appeared in does not actually correlate to drowning in a pool, but as the chart behind me shows, we can sometimes draw conclusions as humans. We look for patterns in data, and there's an obvious pattern in this data. But it's not true, and that's because causation and causality are very different.

The NRDC, the National Resources Defense Council, has identified four types of data centers.

The first is this thing called the Wholesale Co-Lo. This is where you run your entire data center. You own the brick and mortar on the outside, and the racks and the servers and the power and cooling on the inside.

The second type of data center is the Retail Co-Lo. This is where someone owns the brick and mortar, but you're buying pieces of the rack or the grid and someone else is paying for the power and cooling.

The third is managed hosting or some type of hybrid retail and managed hosting, where someone else is managing the VPS for you. This is where companies like Rackspace really shine, where they were managing the servers and just giving you the VMs, but you had SSH access and all of that stuff.

Then we have what the NRDC refers to as hyper scale computing, but what they're really referring to is a cloud provider, where someone owns all of the compute, all of the storage, all of the hardware, all of the cooling, and they provide resources to you as a service, as part of a quota. You can have 10 CPUs, you can 1 VM.

Which of these accounts for the most energy consumption after it's adjusted by size? After we adjust for weighting for quantity, which of these consumes the most power? Well, it turns out that cloud providers only account for about 5% of total energy usage across all data centers, and wholesale co-los, people who are running their own data centers, brick and mortar, are almost 50% of all data center power consumption. To phrase that another way, if we were able to get everyone running in their own data center onto a cloud provider, we would cut energy usage by almost 50%. Visually, cloud providers are not the problem. Our earlier idea that this rise of cloud computing, this influx of services like Amazon and Google and Azure and more recently IBM and Oracle joining the mix are not actually contributing. They are doing more for the problem than your own wholesale co-lo data center.

What causes this inefficiency? Well, the first cause of inefficiency is over-provisioning of IT resources, or as I like to call it, peak provisioning. This is because-, how many people here work in retail where you're affected by Black Friday markets? Okay, a decent number of you. You probably do something like this. Come the month of Black Friday you buy some new server capacity for your data center. You provision it, you ask, you fill out a purchase order, you order a bunch of servers, you install them, and they start consuming power. A server consumes about 40% of its power at zero utilization, so just plugging it in, you're at slightly less than half of its power consumption. You lose a little bit of money up until Black Friday, right. You're paying this energy bill, it's costing you just a little bit of more money, but you need that capacity, because come Black Friday you can't go down. You need to make those sales, that's key to your business. You make a bunch of money on Black Friday. Yay, you're rolling in the dough. Then you never take those servers down. You continue to run at this over-provisioned capacity, and you continue wasting power and under-utilizing those machines.

That leads to the second cause of inefficiency, which is invalid IT procurement procedures. IT organizations generally purchase in bulk and rack in bulk and network in bulk, because it is inefficient to do it otherwise. If you just buy one server you don't get a discount. If you have to take someone away from their job to drive 50 miles to the data center to install one server, it's not cost effective to the business, in the short term.

This is how IT procurement decisions should be made. We should be looking at the long term gains or losses on these machines. We should be looking at the tax write offs for capital gains. We should be looking at the short term and long term and the human intervention cost. But in reality, this is how IT decisions are made. They're made by marketing. They're made by sales, or they're made by lies, and notice the asterisk up there, where ... Right?

Number three. Unused "zombie" servers. One of the byproducts of that over-provisioning is this collection of servers that run at very minimal if not zero capacity. Sometimes these servers are running no applications and just consuming power.

The NRDC has this really, really cool statistic. I'm not a person to put a lot of words on a slide, but it's so good I want to read it to you. "An estimated 20-30% of servers in wholesale data centers are idle, obsolete, or unused but are still plugged in and consuming energy doing nothing." A third of servers in wholesale data centers are sitting there doing nothing.

Well, the solution should be obvious, right? Speaker 2: Just unplug the goddamn thing. Seth Vargo: They go on to say the most interesting paragraph I have ever read in my life, which is that IT managers cannot identify owners for about a third of those servers, so they are reluctant to decommission that equipment because they don't know the business impact. They do not know what will happen if they unplug that machine.

This was such a big deal that the NRDC, along with a couple other environmental and ecological-friendly nonprofits, funded a competition in 2012 known as the Server Roundup. You're thinking this is a joke; it's really not. During that competition, companies were pinned against each other to see who could decommission the most number of servers that were unused, increase utilization and decrease power consumption, and there was a monetary prize for the top winners.

The winner of that competition decommissioned almost 10,000 physical servers, reducing their total environmental impact by five megawatts in IT load directly, which also corresponded to four megawatts of associated cooling, so just under 10 megawatts of power. That resulted in five million dollars in annual energy savings for that company. Now this is 2012, so think about companies that wholesale data centers in 2012. Anyone want to take a guess which company this is? This company was AOL. When asked why, like why didn't you turn these off before, why didn't you unplug them before, we're talking about millions of dollars saved and better for the environment, they summed it up in three things. One, it was a misalignment of responsibilities. Two, there was no incentives to reduce load. They needed a monetary incentive to reduce load. You might think, well wouldn't saving five million dollars be a monetary incentive to reduce load? This is the biggest point, is that electric costs are often paid for and budgeted by a different department. The operations engineers who are racking servers were not incentivized to reduce load because of their budget did not account for electricity; someone else paid for that.

Number four is interesting, and it's a lack of a standardization metric or a utilization metric. What is utilization? 80% CPU? Should you run at 100% CPU? Should you overclock your CPU? What about memory, should we be at 60% memory, is that good utilization? What about 80% memory? What about things like disk usage and network bandwidth where you talk about like IOPS, 1500, is that good? I don't know. Then some people come up with their own metric, which is really some convoluted "Add this, divide by that, is it a Tuesday on the third equinox, good utilization," right? We're not speaking the same language. When I say I'm getting 100% utilization or 80% utilization, that might mean something different to you than it does to me. We need a standard metric to talk about utilization.

The fifth one is competing priorities for efficiency. When efficiency is pinned up against things like availability, deployments, reliability and security, it tends to take a back seat. This is really what most organizations look at efficiency, right, it's that .05%. It's in the backlog's backlog. It is only at this point that we can finally realize how Hawaii feels on a map of the United States. In all seriousness, efficiency is not a top priority for people running wholesale data centers. However, if you look at any major cloud provider, you'll notice they publish statistics, they have entire departments dedicated towards these statistics and protecting the environment. Just like when Apple does a keynote and they say their phone is VPA-free, if you take a look at re:Invent or Google Next they're touting the same statistics about their data centers and efficiency. It's something they care about, and because they're running at such a large scale they can invest the resources to make it do that.

The easiest way to reduce your environmental impact is to move to a cloud. They consume significantly less, about 10% less energy to provide the same performance, and they're conscious about it. Hyperscale cloud computing, as the NRDC calls it, consumes significantly less resources for the same investment. The obvious solution is to move to the cloud, but we have to debunk some myths about why the cloud is a bad thing.

Myth number one, we don't get the same performance on the cloud. This might be true, especially if you're performing a lift and shift operation. But in reality what you're saying is that you probably don't understand the application requirements. If you take an application that's currently running on 10 gigs of RAM on a dedicated bare metal VM and you put it on a T2 Micro, you're not going to get the same performance. So you need to understand your applications.

The second myth is that the cloud is insecure. We have to be HIPPA/PCI/FIPS compliant. This one always boggles my mind, especially since we made a security tool at HashiCorp called Vault. These cloud providers hire the most experienced, read that as expensive, professionals on these topics. Industry leaders on these topics. So I ask this question: are you actually better if compliance is someone else's part-time job? Probably not. The cloud providers have dedicated teams, where if you are in healthcare or you are in finance they will work with you to get you on dedicated VMs, to get you those check boxes on those audit reports. T

The cloud is too expensive. This is my favorite one. For many use cases it's probably cheaper and more flexible, but it goes back to my first point, which is I don't get the same performance. The reason for that is that if we don't understand our applications, it can be more expensive. If you just lift and shift, it's going to be more expensive, but if you lift and shift as a temporary solution, start looking at monitoring and metrics and seeing where you can reduce instance sizes, where you could leverage cloud-specific technologies like Lambda functions from AWS or Google functions from Google, where you can do very quick one-off tasks, you can actually significantly reduce your load. Yesterday we heard a talk from Calvin from Segment, who talked about how they were able to save over a million dollars on their AWS bill using Terraform and optimizings from some of these cloud provider's specific services.

We run our own bare metal and achieve 80% utilization using a scheduler, and routinely audit server usage using an inventory management system. This talk is not for you, but you are in the minority's minority. So, congratulations, there's a clap emoji for you. You are in the minority, the data shows that you are in the minority.

Then the last myth is that I am fine because I'm using a cloud provider. I'm on the cloud, this talk isn't relevant to me. You might be imposing less of an environmental impact than someone who's running in their own data center, but you might not be using your own resources at best capacity.

That brings me to my last point, which is maximizing resource utilization. Human maximize with these things. Humans are programmed to maximize with time and calendars, right. You have your smartwatches, your phones, your calendar alerts. Some people are actually so busy that they have other schedulers called executive assistants, and they're required to evaluate a series of constraints when scheduling activities: Oh we have a meeting at this time, we have to do lunch at this time, this person can't meet here, this person has a commute. These are all constraints that humans solve every day. A scheduler is a person that organizes or maintains schedules; it's a very simple concept. If you put a calendar invite on your phone, you're a scheduler in the very literal sense.

But we also have these computer schedulers. Computer schedulers are very similar to human schedulers in that they map a set of work to a set of resources. Just like humans map a set of meetings onto a calendar, schedulers map work onto resources. A computer scheduler is really just a more generic version of a human scheduler. Schedulers aren't a new concept. They've been around for a very long time. Paper calendars, I remember when I was growing up my mom had a calendar that was like magnetized to the refrigerator and it had all of the soccer schedule and everything on it, and that was where we went for information. When things changed we had to move things and cross stuff out and white things out and draw arrows, and that was a scheduler. Then we have new technology like Excel and calendars like iCal and Google Calendar that make this a lot easier for us, and we can invite other people and collaborate. But they're not a new concept.

I want to talk a little bit at what schedulers do for us. If I'm an operator and I run my own data center or I work in the cloud, it traditionally goes like this, if we look back without a scheduler. We have our series of machines and we give them names. We have to give them names because how else will we know what they are? In this case we have Skywalker, Vader, Leia and Solo, there's a theme here, and we put some apps on them, right. Someone goes to the data center, they drive in, they pull up their console, they either log directly in or maybe they SSH in if you're really hip. They drop off the zip file or the tarball and they run it, and then they look at some logs and they're good, and they drive home for the day. Then we need to put some more apps on, so we repeat that process. And every time we want to change something we might be able to do remote management, or we might have to drive to the data center; it really depends.

If you think this is like super farfetched, it's really not. I worked for a company in 2014 that did this. We drove to a physical data center to do these things, and they were using config management and everything, but we still had to go to the data center to do these tasks.

Then you put these VMs or these physical hosts in a table. It's usually like an Excel doc or a Google Doc, and you say this is the name, this is its IP address, this is its Mac address, and then there's the Notes column. Everyone has a Notes column, because where else are you going to put notes?

Then inevitably one of the servers dies. Vader's just always messing up. One of those servers dies, you get paged. Someone has to drive out to the data center if they can't manage it remotely, especially if, I don't know, someone ran over the power cable with their chair, and someone had to go fix it. You got paged and someone went out, and we had to move the applications off. Maybe you could just plug the server back in, it would come back online, or maybe you had to redistribute those applications to other machines. We had to actually move them because we were going to have to completely reimage that whole machine. We can't use it anymore, it's completely fried.

Let's say we do that, we reimage that machine. we put out the fire. Well, we have to clear out the table because it's going to get a new IP and probably a new Mac address and stuff when we reprovision it. We reprovision the server, we get it its new IP, its new Mac address, we make sure we put the Notes column that it got rebuilt because that's important. Maybe, maybe we'll reschedule those applications on it, but most likely we'll leave it sitting there plugged in until another one of those servers catches fire, and then we have the extra capacity to move over.

It turns out that this can become highly available if you give those operators these things called pagers. This obviously doesn't scale. If you have to drive 50 miles every time an application goes down, or to just change out a server, it's not going to work. It doesn't scale, especially when you think about the size of some of the organizations. Modern schedulers codify and automate this process. That thing I just showed you of moving an application onto another machine because a host has died, is done in an entirely automated fashion. We don't have to wake someone up in the middle of the night. You tell your scheduling software, "I want 10 copies of this app," and the scheduler just decides to put it on a series of hosts. We can put constraints, we can say "Never run this in Europe" or "Always run this on this pool of VMs, but we don't have to wake someone up in the middle of the night if one of those VMs dies. We can just reappropriate that work somewhere else. Maybe we send an e-mail or some kind of notification so someone can deal with it in the morning, but we don't have to wake someone up in the middle of the night.

So, schedulers map resources to work. They're not a new concept. CPU schedulers map threads to physical cores on your VM. EC2 and Nova map VMs onto hypervisors. Hadoop and YARN map reduce jobs onto nodes or a set of worker pools. What I'm talking about here, the thing I just described, is what we like to call a cluster scheduler, which maps a series of applications onto VMs or hosts or machines.

Again, cluster schedulers are not a new concept. Google has Borg, AWS runs a scheduler for EC2, Netflix is running, boy, I don't remember, and Twitter was running Aurora for a while. Netflix is Titan. These schedulers aren't a new concept. They've been around for a very, very long time, and they've only been run at very large scale, because for a while they were only needed at very large scale, they only solved problems at very large scale.

Now, as we move towards these cloud computing and highly ephemeral environments where we're trying to maximize resource utilization, the user experience and the tools surrounding these schedulers has gotten significantly better. It's not out of the question for a small startup to consider something like Nomad or Kubernetes to run their application. Now, you should never put your personal blog on Kubernetes. If it gets five users a year you're kind of overdoing it. But for any application where someone would get paged in the middle of the night, it's not out of the question to consider a scheduler, especially if you're trying to be environmentally conscious or financially conscious and maximize your resource utilization.

So, what do schedulers get us? Higher research utilization and abstraction that decouples work from resources, and an ultimate better quality of service, and I like to call that quality of life as well. On the higher research utilization, many schedulers offer bin packing algorithms. You can provide over-subscription and job queuing. Many schedulers do more than just long-running services. Perhaps you need to send out 10,000 e-mails reminding people of something. A scheduler can do that for you. If you do have a long-running service a scheduler can do that.

Decoupling the work from resources is very important. It provides an abstraction with a series of contracts. Instead of directly writing code that directly integrates with this one thing and you're tied to it forever, we have this higher abstraction. We're not tied in, we're not vendor locked in. You can extend this further and say, "Oh, I don't want to be on Amazon, I don't want to be on Google because I don't want to have vendor lock-in." Well if you use a scheduler, you can actually bridge across all of the clouds and treat Amazon, Google, Azure, Oracle and IBM as just pools of resources. You submit a job, it runs in one of them or all of them. It provides the standardization. You can treat all of those clouds the same. They get presented to you as the same.

Last is quality of service. We can do things like priorities. Job A is always more important than Job B. We provide resource isolation, whether that's using cgroups or containers, and preemption. Ultimately, all of these things provide better quality of life.

In conclusion, if you're trying to do your part to help the environment, or this is something you're interested in and you think it's cool. You should consider a cloud or multiple clouds. You should definitely investigate a scheduler no matter your size, and most importantly you should measure everything, so you can sleep better at night, not only knowing that your pager won't go off because you're using a scheduler and it's going to do the work for you, but also knowing that you're doing your part to protect the environment. Thank you.

More resources like this one

  • 4/11/2024
  • FAQ

Introduction to HashiCorp Vault

Vault identity diagram
  • 12/28/2023
  • FAQ

Why should we use identity-based or "identity-first" security as we adopt cloud infrastructure?

  • 3/14/2023
  • Article

5 best practices for secrets management

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones