Skip to main content
Save 10-15% Register for HashiConf 2025 and save big when you buy 2+ tickets Get your passes
Presentation

Online Experimentation with Converged, Immutable Infrastructure

Timothy Perrett’s fascinating talk at HashiConf 2017: Online Experimentation with Converged, Immutable Infrastructure.

Testing complex systems at scale is incredibly difficult. “The container works OK on my laptop,” isn’t really a useful test result.

A modern, complex system might be composed of hundreds of microservices and several databases. It might need to scale to inconceivable usage peaks. Also, there’s human error to consider.

In fact, an infrastructure can rapidly reach the point where it’s literally impossible to test new versions before putting them into production. Reproducing reported problems is also impossible in a standalone test environment.

So we might as well embrace reality and plan accordingly—with deployment testing.

But traditional “canary” deployment testing is far from an ideal solution, because the unit of work is too large. In other words, it lacks the granularity required to scientifically test a new deployment.

Enter: Online Experimentation: Choose a small segment of your total user population and use them to test the new version—further subdividing them into “control” and “treatment” groups, as if it were a drug trial. Use your edge servers to route traffic to the test groups or to the mass of “normal” users, allowing you far more fine-grained control over the test.

Learn more in this fascinating talk, including how to use tools such as HashiCorp Nomad, Envoy, Nelson, etc., and how to cope with the challenges of testing complex systems at scale.

Speaker

Transcript

Good to see everybody. I'm Tim and today I'm going to be sharing with you some of my learnings over the last, I guess, 10, 15 years of doing distributed systems. Both operating, designing and building them, particularly within large enterprise companies.

So, first thing that we're gonna be talking about is conversion in immutable infrastructure. Just a quick recap on those for those people who may or may not be familiar. But, for most of our time together I'm going to be talking about online experimentation. What it is, and design trade offs and different ways that we can make it work. Operational complexity trade offs and so on.So, we've only got about 35 minutes together, so we are gonna get started as we've got quite a bit to get through.

So, immutable infrastructure. What is it? It's the practice of replacing software components rather than upgrading them in place. Machines should not be long lived. They instead should by dynamically provisioned from fixed images and destroyed as needed. Application systems they're versioned, and then they're replaced with newer versions as time passes. Operators like the Opsten kinda move environmental conditions around the application, but the application itself should never have changed, or deviate, from the build time artifact. So, typically this is a container or a virtual image in some environments.

So, converged infrastructure. For the purpose of this talk, what I specifically do not mean is hyper-conversion infrastructure that some storage vendors might be talking to you about at the moment. It's kind of buzz wordy. So, I specifically do not mean that. So instead I'd like to draw your attention to the high degree of unutilized resources that most data centers carry in industry today.

So traditional ideas around computing environments, they create physical barriers in our virtual world. Firewalls, networks, racks, so forth. So in this case a traditional stage, or development environment, this is a production environment, servers are literally separate. And while stage machines may only ever receive a very small fraction of any amount of traffic, versus the production environment, they cost the same. You pay the same, they're all the same. So whilst these partitions, they were added typically to provide security or perceived security, perceived Sandboxing, they actively prevent a high degree of resource consumption. This will forever couple us to a solutions base that doesn't really take into account, in my opinion, how technology has evolved.

When resources are statically provisioned in this manner, it doesn't really matter if you run on a public Cloud like GCP or AWS, or instead if you run in your own data center. You're either paying for VM time, or you're paying on electrical and depreciation on your capital investment. So in most installations dedicated machines, they only ever run a fractional load average. Your literally wasting money.

I don't really blame anyone for this situation to be honest. It's evolution makes complete sense. If we consider for a moment the relationship between application developer and data center operator over the past decade, or more, the product people have typically had to work directly with the operators, asking for machines, and the operators then go rack and stack and do whatever essentially was needed to satisfy the request. And then they make a note of who they put them in for and that's mostly that, until there's some kind of outage.

Now the data center team don't want the product people poking around with SSH, and the product people, frankly, have better things to be doing. So this mode of operation doesn't really serve anybody particularly well. So if you fast forward to the current day, as all of you know, you're here at HashiConf, we see this blossoming middle sector. And this field of, what I like to call infrastructure engineering, has provided tools to sort of fuel what people call so called DevOps revolution. So this empowers both sides of that previous relationship. Data center operators now focus on the hardware, and they keep the site running. Whilst product engineering folks deliver business value, making use of those self service tools. I think it's these very tools that allow us to rethink how we operate our systems. We no longer have to be beholden to those dedicated machines. Instead, we can see our fleet as one uniform pool of resources.

And so if we imagine what that looks like. It kind of looks like this. All the yellow squares would be like our machines, and we have a scheduling substrate. Now in this case, HashiConf, so you've just heard a good talk by Alex, talking about Nomad. So Nomad would be one example of a scheduling substrate. MAZOS would be another. Cooper Nahis is another. So on and so on, there are many different products for these things. The point is, your decoupling the people who own the data center, versus those who operate workloads on top of it.

It turns out that these users are actually quite demanding. They need very fast saturation. They don't want to wait. They don't want to wait 20-30 minutes for a VM to be provisioned. They want to ideate on models and things in production. They need a high degree of observability. They need to know what's going on. If you say "No access." then they need logs, they need metrics, they need all their monitoring. They need tracing. They need seamless application revision, migration. So if I'm on version one, and I have version two I need to be able to shift all the traffic, provide observability into all that, make sure they know what's going on.

And last but not least all this has to be self service, and it has to be completely, absolutely automated. There should be no manual intervention for any of these steps. And the thing is, is that all of these are just the table stakes. We haven't even got to anything interesting. And I kind of feel like failing to have these tool's sort of dooms an organization to some manual access and human bound processes. And this kind of represents a pretty large business risk for any given organization. Manuel process is typically a crutch for automation avoidance. And so I like to think that if we can automate everything that we can, if we don't what we fail to automate becomes encoded in our organizational folklore. Essentially communicated by person to person with word of mouth. Fundamentally not scalable.

So, the guts of the talk. Experimentation. What hopefully most of you are here to hear about. More often than not, our gut feelings and intuition, they mislead us. As humans, we struggle to effectively access the value of any given idea. We think that, what we think may be an expected outcome, often turns out to be entirely wrong. So being able to objectively compare expectations and outcomes can prove to be this sort of innovation accelerant within a given organization.

If teams feel like trying something new that's pretty low cost, safe, and measurable. Experimentation can really become this. This whole part of engineering organizations social fabric. If it's okay to trial, it's okay to try, and it's okay to fail. That's really important.

Now this is an operational conference with a lot of Ops people and so on, so most of us just like to call that testing and production. So herewith we're gonna start talking about testing and production. So as I was preparing for this talk I found this really great tweet from Charity Majors, "Easy failures should always be caught by tests. For anything ultra complicated, you pretty much have to test in production." And she is 100 percent right about that.

So Let's talk about testing and production. Everybody's done it. Nobody wants to own up to it. So if you've ever done a rollback in production because something went differently than how you expected, you've tested in production. An experiment if you will. So, many engineers find themselves with this false sense of security. When their test suites pass they're often surprised by the sort of interesting things that go wrong in the production environment. So even with functional tests they really are on this first layer of testing. They seldom cover the wide variety of things that can go wrong. Particularly with systems of scale. So let's talk about some of these things.

One of my favorites is this concept of emergent behaviors. Our distributor systems are more complicated than ever. Microservices have caused an explosion in complexity. They make service contracts and availability of peer systems harder than ever to reason about. And there are two laws which I think sort of codify this intersection of these laws are kind of what produces emergent behaviors.

So the first is Conway's Law. Any organization that designs a system will produce a design who's structure is a copy of the organizations communications structure. Any one who works in a company that has more than 100 people will absolutely be familiar with this. Organizational siloing, simply not knowing what other parts of the organization are doing as product of your size.

The second is Hyrum's Law. With sufficient number of users of an interface, all observable behaviors of your interface will be depended upon by somebody. Now if we think about this in the context of micro services, what we typically see is, if I'm in one group and there's a service in another group and I don't really know them, but I know they provide this service, they may have five API's, three of them might be like highly optimized for their heavy read work flow the lat service was designed for. Then they might have two right work flows, also two right work flows for sort of administration piece. Now from my perspective, I don't know them I just see the interface they publish. So I go "Okay, great, the two right APIs are the ones I need cause I need to ingest data into the system." So I call them, maybe I call them to aggressively and maybe I take the system down because I didn't talk to them, I didn't know. And so I feel like this is often, the intersection of these two laws is often where a lot of our complexity in micro servers systems and it's emergent behaviors that we see in systems come from. Things that are misused probably are many, many different cases. But all social problems.

So when we talk about micro services what nearly always comes up is containers. And one of the supposed benefits of containers is sort of this advantage of local testing. And I think that's a complete fable. So if it works on my laptop, it's not really a useful endorsement of pretty much anything. Most systems, if I consider a typical service with a data base maybe I've got five of those said services, each with their own data base. It's not gonna fit on my laptop, I only have eight gigabytes of RAM. Now I could get a bigger laptop but I would only buy myself certain headroom. If I consider that most micro services architecture end up having hundreds of thousands of services, local testing is a debunked myth.

So many production systems they have average, low average volumes of traffic. But they have massive peaks in traffic. Twitter is notorious for it's fail outages in the early days. Those kind of user generated, unpredictable, traffic peaks. Literally made them victims of their own success. It would surly only take one cute cat meme to bring down Twitter. Joking aside, this is the world we live in. These peaks and troughs are something that we have to adapt to. There are more people with more devices online than ever, and it's not decreasing anytime soon. So we need to develop strategies to effectively handle and scale these kind of problems.

Last but not least, and this is one of my favorite by the way, is human error. We can automate, we can do many things but we're still human. We make so many mistakes. There was a really great study out of Facebook, they conducted a study and it showed that their most stable system was actually present on the week of Christmas and the week their yearly reviews were due for submission. The stability of the system was directly inverse to the number of developer chain sets being applied. The more they worked on it, the worse it got. So it's kind of interesting that that happens and I think most organizations can relate to that sentiment in on way or another. Businesses, they prize, it's highly prized to have high velocity developer paralism. It's coveted by most. And despite this comes with the cost of overall systems stability in many cases, it's okay to move fast and break things, but move faster and fix things.

Now with all this being said the concept of experimental testing and production is not a new one. We can look towards a lot of Prior Art. Specifically the papers and publications from the likes of Google, Amazon and Microsoft among many others. They form the foundation of much of our industry knowledge about experimentation systems. Microsoft's Bing experimentation system is one of the largest in the world. Has over 200 currently running, concurrently running experiments. They expose about 100 million users to billions of Bing variants, every single day. I think that's a really staggering statistic. Now if I think about these different companies all operate on different market sectors. If your an add company you care about transaction rates or impression rates and targeting. If your an E commerce store you care about your transaction rates, "Why aren't people buying things?" If your a telecoms company you care about "Why aren't people calling my call center? Why are people returning my product?"All of these kind of things make a huge difference.

And so regardless of the paper or Prior Art that your looking at, a fundamental tenent of the concept of experimentation is this idea of segmentation. That is, from your total population indicated here by the outer circle, you take a subset group and use them for testing. And so within that test group, half the participants will be an experimental control. Where no changes are applied. They get the vanilla experience, whatever that is. And the other half are your treatment group. That is that which receive a modified experience in some way.

So when we talk about segmentation, people often think about making groups like, Males age 20-30, or something along those lines. Which is nearly never what you want. Experimentation typically works best with random application. Even the slightest targeting can subtly skew your results in one way or another. For example, lets consider, you conduct an experiment and you randomly filled your segment buckets with participants between 10 am and noon, and maybe it's for a consumer device like, a TV or something. So whilst your participants are indeed random, you've introduced an inherent bias. It's highly probable that most people would be at their place of work during those hours and so you've accidentally introduced participants with a very similar demographic profile.

And so, with this frame it's really important to appreciate the nuance that comes with segmentation and targeting for experiments. If you introduce any kind of bias you have to be absolutely sure to eliminate that or at least account for it when your interpreting those results.

So we know that we need a segment of the population. There has to be some threshold. Which is defined ahead of time by some operator, to specify what the experiment needs. So for example, I need a hundred thousand users, I need a million users. Something like that. So the first thing we need to figure out is, is this inbound request from a user that's already part of an experiment or not? Should it be part of an experiment or segment? Maybe we exclude a certain category of customers because they're very high value. And then we account for that bias. Whatever we do this is typically done by having an edge system intercept that call, reach out to a segment assigner, and then make that decision. Naturally there are optimizations that can be done here to avoid a call on each and every request from the user, but for the simplicity of this talk we'll just assume that we call every time.

Now the segment assigner has a very low round trip lane seat budget, in which to figure out if those experiments that are currently ongoing require additional participants or not. And any outstanding experimental buckets can be filled in sort of a random order. Until they meet that specified participant threshold. So if I've got three experiments that I need to fill I'm just gonna randomly fill them. Rather than trying to do anything fancy like bit bucketing or anything like that. Again we don't want to introduce any sort of random bias.

So inbound requests low to the edge, they optionally get assigned to a segment and then they propagate that experiment a segment handle throughout the segment topology. This is probably the hardest part of experimentation for most organizations. As it requires that you have buy in from pretty much all the service owners in a given call chain. If your already supporting something like Open Tracing, YEGA or Zipkin, it's highly likely that you have some kind of context that you propagate around. So bolting on that experimental context is probably pretty low impact. If however you don't have distributor tracing, you can use the addition of tracing as kind of like a nice incentive. A carrot if you will, to get people to change their code and adopt your experimental system.

Last but not least. Every single system wishes to participate in experiment need to publish their telemetry, publish the outputs. A tank with a segment handle so that analysis can differentiate those various data points. Now this diagram shows the telemetry being Google it could easily be Kafka a monitoring solution, or even just plain old logs, it doesn't really matter. The key thing is that you gotta record that behavior and export it to a point of analysis. It may be that you have multiple sources contributing to that overall analysis, but the overall design is the same. Make sure you export and tag that data so you know exactly where it came from. Was it the control? Or was it the streaming?

Now we know how our experimentation workflow could work. We need to address some of the practical matters. That actually get participants into our experimental systems. So typically, and I'm sure many people in this room will be familiar with this kind of Canary Deployments, I admit I've got my one dot one dot one version, I deploy one dot one dot two and then I continue to replace machines up until the point I've replaced them all. I actually consider this pretty harmful, I think the unit of work is simply too large. A modern Lynx machine can handle thousands of connections or millions of connections. So the scope or speed of my experiment, I don't want it to be governed by the operational choices you or someone else made. It lacks granularity.

Whilst this has worked for quite some time with varying degrees of mileage for different organizations, I kinda don't think it's the ideal solution. So granular traffic splitting is sort of a much better alternative in my opinion to Canary machines. And this allows operators to push and pull traffic around their topology rate that makes sense for their application. Ramping up traffic on the new applications division as needed, pushing back to the older version if the patch is not working out as intended, it's important to note that we're assuming that there's no multi vert testing here, for anyone who's interested I could talk a little about that later.

So the benefits of this approach are granularity and fine grain control over a given experimental segment. But the downsides are that you need operational resource capacity. So that you can run the old one and the new one, scaled at the same time. In the vast majority of cases this doesn't prove to be too much of an issue as most clusters are over allocated.

Now the next mode of experimentation is this idea of what I would refer to as ad hock experimentation. Typically speaking you might have engineering staff who want to deploy something and sort of poke around or otherwise run some integration test themselves to ensure that when their new code is deployed that it works as intended. And this is what per request routing gets you. A request is sent from the user with some context, for example a header, intercepted at the edge and forwarded to the appropriate internal service, no matter how deep it is in the call chain. And so this kind of flexibility is extremely useful for that sort of Dev testing or QA style testing. But it's sort of naturally not a scalable solution for production, but it's nevertheless a very useful use case. To be able to silently deploy something in production and then access it so only you can access it is very useful.

So the astute audience member may have noticed that the previous two strategies only work for systems that were internal to my service topology. And the edge component is handling requests interception and delegating to the appropriate downstream. So this presents a specific problem for edge systems as it would turn out. And as a way we need to work with these assuming we can't just go to the teams that run the edge systems and say "Hey, no experimentation for you folks." We need to think, "What can we do about it?" So in this case one strategy would be to use good old DNS and flip flop between M clients. And this can work in some cases if you fully control the client, but by and large, it's kinda sub optimal. It may be if your a phone app or a POS or somewhere where you don't control the client, you don't control how or when updates are applied, it's not that great. Now you could mitigate with device polling and SRV record, or something similar. But those strategies kind of start to fall down in low power environments like IOT where background processes are severely limited due to low power or connectivity restraints.

So instead of using DNS at the edge I think a better, more robust solution is to use sort of this idea of edge trees or tree proxies. So this means that you have a stable, consistent DNS point for clients to call. So no update is required and you can split and shift traffic from that common entry point. So whilst this diagram only shows the tree to one level, there's nothing stopping you from having as many levels in the tree as you need. With various points of splitting and delegation. Important thing to note here is that these two edge systems can make use of common internal systems. You're not building an entirely new world, you're simply allowing experimentation at the edge of the system.

So I think whilst this can be quite effective, you need to be aware that of course you're paying a small latency cost here in terms of the network ops. So I think you need to be careful to mitigate that latency increase by collating those ingress points as close to your edge system as you can.

Last but not least an experimentation strategy that probably is familiar to those who've worked on monolithic deployments, internal code path changes that make use of inbound request contexts and then they figure out which functions they need to call based on that. So broadly speaking I think the other strategies we just discussed, they're superior. But there are a few cases where this was still sometimes the more viable option. I think specifically when conducting sort of data bound experiments such you need to try out a different storage model, or perhaps you're even trying a different database. Whatever it happens to be. The application has to have some kind of knowledge that you're trying to conduct that kind of experiment. And with that being said, I do think it's rare, but if its used excessively it can get totally out of hand. From a management perspective, engineers are not sure which experimental paths in the code they can safely remove. Like, so that kind of results in never removing any, and creating this immense amount of technical data.

So those are a few different strategies for experimentation on the server side. And there are indeed several other strategies that we haven't discussed, but it allows you to focus on these cause I think there's never been a better time to embrace building these kind of experimental systems on the server side. So we have more building blocks for routing, scheduling, workload mobility than ever before. And so in the next few slides we'll talk about some of those things that are available and how you can use some of those pieces to build your own experimentation systems.

So the first obviously is the scheduling substrate. This is probably one of the most well known elements so I won't spend too much time on it. The space has kind of absolutely exploded in the last few years. Nomad, Kubernetes, Mesos, to name but a few of the available options. And I think each system gives you pros and cons. We've decided to do a couple of different things at work and I think if you can containerize your applications right now, or if you cannot containerize your applications right now, then perhaps Nomad or Mesos would make sense. Likewise if your not familiar with resource managers or you don't want to write your own scheduler and you want more of a path like experience, maybe Kubernetes is a better fit. Point is, you need to objectively look at your own requirements. Look at your own organization. Make an informed choice. I would definitely encourage you not to follow the hype. All of these products have various different trade offs. I've run Nomad and Mesos at scale in production for a number of years and it's very interesting to see the trade offs. If anyone's interested I can talk to you about that afterwards.

The data plan. The data plan is the part of the infrastructure that's responsible for routing those requests to and from different parts of our topology. Now the various proxy technologies that are available today, some old, some new, but the real difference with these modern variants is there ability to integrate with a so called control plane. Envoy is leading the herd in my opinion. It has mas investment from Google, IBM, Lyft and even just actually a couple of weeks ago, Anginex had said that they had started to support some of the same control plan APIs. Meaning we could end up seeing a standard for these kind of control plan APIs. Hopefully standardized for experimentation, that would be awesome.

Other options in this base are LinkerD, Nginx+, there are various different things that you could do, but again look at your requirements. See what you need. Make objective choices. Now if your gonna run this how does that actually look? So in terms of operating that data plan, there are potential innumerable different configurations. So I'll just talk about three, cause it's kinda the three that many people kind of jump to. Talk about some of the different trade offs.

So in this case we've got some machine. Nomad is my scheduling substrate, Consul is my service catalog. I've got some container, deployed on some node, some allocations and I've literally got Envoy embedded inside my container. This works for most people. The problem is you then are unable to reason about resource consumption for Envoy versus your application and then before you know it you need your logging and several other things. So it's kind of a slippery slope. Easy to get started but it kinda falls down.

I guess a more Kubernetes style way of doing it, a pod for anybody who's familiar, is sort of to run this thing as a sidecar. In Nomad parlance this is a task group with multiple tasks inside the task group. And so what that means is that when it comes to scheduling those two containers are discreetly scheduled on the same machine and then they basically have, lets say it's running in bridge mode, they both have the same, they both know about each others IP port combinations on that bridge.

So we can do this, this has some benefits. If we decided for example, that we wanted to do a mass update of the Sofra suites that we supported, we could simply roll through the cluster operationally without anybody knowing and update all the Envoy sidecars. To update their cell configuration. That's really a very nice pro. The up is obviously that we are running more containers, we have slightly more complexity, there are more moving parts. So again, no perfect solutions, just different trade offs.

The last is host based Envoy or host based data plan. I'm saying Envoy in these slides because that's the choices I've made, but please don't just assume that's the only way. I'm just highlighting this is another way that you could possibly run the data plan. Now in this case host based Envoy is kind of attractive. I have only one thing on every host so, and I have as many parts in my routing infrastructure as I do machines. This is kind of nice. Obviously the down side is that in the case where I do have a problem with that data plan then I'm actually essentially damaging QS for every single application run on that machine. And so in many deployments of these kind of things you end up with a high degree of tenancy, so you might have 80, 90, 100, 150 applications on a given host. So it becomes unwieldy to manage over time would be my opinion.

So, we've talked about the data plan, lets talk about the control plan. Sort of the more interesting part to be honest with you. There are various different things that can be done, that right now are available is Open Source or Indeed Commercial Projects for control plane, so Istio, LinkerD and nelson. I guess shameless self plug, Nelson was open sourced by my group at Verizon and we, yeah. So it's open sourced as long as you get Hub, go and check it out its, if your running hash stack it's really nice.

So this is essentially how the control plane works. The control plane basically sits there and any orchestrator that you have, whether or not it's Nelson or whether or not it's Istio or whatever it happens to be. It's basically saying "I want this traffic to go here, I want this, I want this." And setting all the constraints. Like, I need the Pho application version one dot one dot one. Where is that thing. Translate that logical name into IP port combinations. Now they've all got different IPS, they've all got different ports. And so this kind of control plane serves up that information. In many cases what you can also do, is you can do interesting things like automatic zone afinity, so if you're inbound and you say "Hey I'm like IP ten ten ten ten kind of thing." We can say "Okay, based on the IP we know that that's in AZ USDS 1B kind of thing and then you can give a response that prioritize those as things that it should try and call first.

And so there's all these interesting optimizations that you can do to sort of lower latency, increase your overall availability. And so the other nice thing about this is that in conjunction, the control plan and data plan you can get automatic encryption. So there's this sort of very nice thing that happens which is, if my applications makes a request I can dynamically, even transparently to the application, make a dumb HTTP request then in the data plan I can say "Okay, take that request, encrypt it, wrap it in mutual TLS." And then I can have vault provisions dynamic credentials for every single container in my entire platform. Where every single container has unique credentials, unique certificates and all the traffic is being encrypted on the wire. So I can, you probably for those who were in the keynote this morning saw namespaces, so even if I'm in the same namespaces I still get privacy for my application. And I can get that in a really automated way.

So we do this nice thing from the control plane in Nelson, which is that when users deploy a new application, they had version one dot one dot one and they deployed version one dot one dot two, we give them a choice. Which is "How would you like me to lead your traffic? How would you like me to shift your traffic?" Now in some cases they can choose a power curve, they can choose a linear curve, which is the default, and you can do these kind of interesting things so depending on your applications you might have different requirements. Point is, the control plane is just another means, it's the main means by which you can control the traffic and workload mobility in your system.

It's not all roses you know. This stuff is not immediately easy to do. There's nothing I can give you that will just plug n play into all your Legacy applications, sorry. There's this really nice law I think when it comes to analyzing how these things are working just this Twyman's Law. A piece of data or evidence that looks interesting or unusual is probably wrong. And we see this quite often, so it's really important to objectively measure, account for your own biases and make sure that whatever your doing, however your experimenting, however your comparing these things, that you know what the inputs and outputs are. You understand what the bias is. And you take an objective look at the analysis.

In conjunction with this disparate data, whether or not you've got some stuff in Splunk, some stuff in Prometheus, some stuff in StatsD, some stuff in whatever, you need some way to holistically look at all that data. And for many organizations, particularly larger organizations, this becomes a problem over time and it's sort of this whole engineering effort on it's own. To kind of coalesce this data to answer these kind of experimentation questions. So I understand it's not easy for many organizations.

Now when it comes to observability, you want to make sure, in order to feed that analysis, that every single piece of data that your importing is tagged with segment identifiers. And yes, that does mean modifying all the applications as we were discussing. And I think, I bring it up again because I really do feel like this is the biggest challenge. And there are many things we can do as infrastructure engineering to kind of incentivize and help them migrate, but it still truly remains one of the hardest things. I do think that despite these difficulties, despite the problems with doing that, I think doing these kind of systems has major impact on an organization. I think it can really build, really change the way you work as an engineering organization.

The bottom line is I think that experimentation infrastructure makes your workloads mobile and that sort of super useful for a variety of reasons. Both via the septum substrate and via fine grain traffic control systems. This kind of tooling really empowers your organization. In a multitude of operational vectors. It doesn't matter if your providing fine grain security or enabling overarching strategy like hybrid Cloud. Being able to push and pull your traffic patterns within your network topology as needed, perhaps as an experiment or perhaps your fixing some late night outage. It doesn't really matter. But having the flexibility to do that. The flexibility to move your workloads, the flexibility to move your traffic in a way that suits your business is really, really powerful.

I think sort of last I'll leave sort of a final thought. I think whatever you decide to do you need to make sure that it's fully automated. Like whether your aware of it or not, someone in your organization is probably doing a lot of these things by hand. That's a really difficult kind of situation. So we want to try and empower them. Empower the organization, to make mistakes, learn from their mistakes and sort of objectively compare how was I then, how am I now.

I've got a few minutes so I'll just share a couple of stories, just to kind of inform some of the people. So it's actually quite interesting, once you start to do these experimentation systems, you can have all sorts of interesting automation. Like you can say "Okay, I had the old one and then I had the new one." and then I can say "Okay lets automatically lock all the data, like objectively and build confidence score." And the system can say things like "I'm 86 percent confident this is better than the last one." And so then you can do automatic promotions. And then you don't need people going into whatever system it is to do a promotion. Like the concept of promotions can kind of go away and you can instead fall back on fully automated testing, fully automated system are lights out to operate.

More often than not our systems will heal themselves in the middle of the night when they go wrong. And they do go wrong. But regardless of whether or not, it's no matter. Just rescheduling the work or, our system, Nelson, is shifting traffic away from the problem. This is really, really useful. With that I understand we're not gonna have Q&A so if anybody does have any questions, I know we just covered a lot of ground and this is all kind of perhaps a little vague, perhaps it's useful for some, but regardless, thanks very much for having me and if anybody's got any questions I would love to talk to you on the side afterwards or find me in the corridor.

Thank you very much everybody.

More resources like this one

1/19/2023Presentation

10 Things I Learned Building Nomad-Packs

12/31/2022Presentation

All Hands on Deck: How We Share Our Work

12/31/2022Presentation

Launching the Fermyon Cloud with Nomad and WebAssembly

12/31/2022Presentation

Portable CD pipelines for Nomad with Vault and Dagger