How did eBay Classifieds move from locally managed infrastructure to a private cloud extremely quickly? This talk explains how eBay selected its technology stack and the decisions that went into building a stable, highly-scalable platform.
During the past year the Site Operations team at eBay Classifieds Group has put massive effort into moving from locally managed infrastructure to a private cloud. With the announcement of eBay Classified Group's new global strategy and the technology center being placed in Berlin, the engineering teams were given the challenge of developing a new and innovative plan to how we build our new product.
With many challenges related to the project, one of the biggest was building a platform that was production ready in an extremely short time frame. This talk dives into how we selected our technology stack and the decisions that went into building a stable, highly scalable platform.
Welcome, everyone. Nice to have you here. Glad I can be here. We’re a bit short on time, so lets just hop straight into it.
A fully containerized platform, that’s basically the topic of the talk, one half of it. The other half is, it’s based on infrastructure as code. More buzzwords, more fun. Let’s just talk really quick about how this is going to go.
We’re going to do a short bit of introduction. One part is this: I’m going to tell you a few things about me, who I am, what I do. And then we’re kicking off the demo. It’s a 2-part demo. So we’re kicking it off and then going to check back on it after the talk. While it’s running in the background I’m going to tell you what we built there, what the demo is doing, and then at the very end we’re not going to have Q&A because someone decided to talk longer than expected. I’m probably going to be hanging out somewhere in the back after the talk; just come to me, or hit me up on Twitter or whatever. I’m open to questions, feedback, pretty much anything.
What does that even mean? I cannot tell you. That’s what the talk is about. You’ll find out afterwards. I hope you know at the end of the talk.
Some quick words about me. My name is Rick. I am site reliability engineer at the eBay Classifieds Group, Global Motors vertical. It’s an extremely long job title; it’s just an SRE at a central team. For the questions and feedback that I mentioned, that’s my Twitter handle. Just hit me up if there’s anything you want to talk about: feedback, questions, as I said, anything.
So, demo time, as I promised. As I said, we need to start off the demo right now because it takes a couple of minutes. What’s going to happen is I’m just going to kick it off, and then later in the talk we point out what it’s doing. It’s basically just this one; mark that for later. That should be pretty much it. So we wrapped a makefile around the Terraform commands. It’s going to run in the background, and we’re going to check back on it later on.
Due to a bit of time constraints, I’m going to rush a bit through things. So if anything is unclear, please just call me out.
Let’s see what we did there. At the end of last year, the company came to our organization, as in the SRE org inside the company, and said, “Look, we want to build a motors vertical, a global one. Could you just build a platform for it?” We were like, “Yeah, sure, that’s going to be amazing. Is it like greenfield work?”
“Yeah, You can do whatever you want, but if you could do it fast, that’d be amazing.”
On one hand, we were super happy, because who doesn’t like greenfield work? On the other hand, the time pressure forced us to adopt some things from the platforms that we already had worked on, but we also wanted to do things a bit differently. We wanted to follow some best principles and best practices, so we had some recommendations and some requirements from our side on how this new platform was supposed to look. We wanted everything to be resilient, wanted it to be stable, easy to maintain, easy to scale. Basically, all of the things you really want from an operation’s perspective that make your life easy. I think that’s pretty much what everyone wants for their platform.
We had some really interesting ideas for what we were going to build, and then someone from external said, “Well, did you ask the product team what they want?” And we didn’t. So we had to do that afterwards, and then went to the product development team and asked them what exactly are their imaginations of this new platform. Because we could have built just a solution that we like, but at the end of the day a platform is the whole thing, not just our plumbing.
There was 1 SharePoint. They also wanted stability. They wanted shared pre-production. But then they also wanted to have “works on my machine” kind of thing, and they wanted real-life infrastructure and real-life components for testing.
The cool thing is that we’re running on an OpenStack private cloud. And we’re basically giving a little tenant to everyone who works in the company who has logged in once. We had the really cool idea, “OK, let’s just build the exact same platform in production, non-production, and the dev tenant, for each and everyone in the dev tenant.” So it would just be a miniature version of production.
That’s where we come to the real-life infrastructure and real-life components. Now we already also found out what we’re doing in the background with the demo: We’re building the miniature production version in my personal dev tenant, which then hopefully at the end of presentation will allow us to deploy the platform or services from the platform—not all of them. So how do we do that?
As I said, we’re running an OpenStack private cloud. How to built from there, right? That’s not really managed by us; we’re just free to use it. We’re just building on top of that. We thought. “OK, it’s not really feasible if we just go to the Horizon—that’s what the UI of OpenStack is called. It’s not feasible if we just go to the UI for every developer and click them their instances.” That’s not really going to work. It’s not going to work for developers, and also not for non-production, also not for production. So we needed infrastructure as code.
All things need to reproducible, easy to understand, and also easy to replicate. What do I mean by that? Easy to understand: I mean everyone has used Terraform. That’s what we use for it. That’s why I’m here, HashiConf. Everyone who has used Terraform knows that it is pretty easy to understand in most of the cases. The reproducibility and easy-to-replicate part basically means if we expand to a new region, we just replicate the whole infrastructure extremely easy. If we have a crash, we just reproduce our whole infrastructure really easy. Because it’s infrastructure as code. We just run Terraform, and it will spawn all the things for us: instances, networks, subnets. It just makes our life extremely uncomplicated.
So that’s one side of it, but then you only have virtual machines in your private cloud. That was not really doing the job good enough, because that would have meant that we just had to deploy whatever the PD team has produced to bear via instances. That’s not really 2018. We didn’t want to do that. We just considered that it’s a lot easier to put things in containerized files.
So that’s the fully containerized part.
Why am I saying it’s a lot easier? Less config management. Imagine you have a VM and you want to run Java on it. You need to configure a lot more. And now imagine you have a microservice architecture and you need to configure every host separately for every microservice that you run. We really didn’t want to do it, so we just stick the artifacts in the container, run it—makes life a lot easier. All of the hosts are almost the same. Easy to maintain.
The next: easy deployment. Basically, we didn’t have to take care of, “How do we get the artifacts to the virtual machine so we can make sure they’re actually running?” We just, in our build pipeline, built up our images and pushed them to our internal registry, and all you have to do is just run a Docker container on an instance.
But now, if you think about it, is that feasible at scale? If you run more than 1 or 2 instances, probably not so much. Imagine you had to like a gazillion instances and on each and every one of them you had to run Dockerrun or however your container is called. That’s not feasible. So we thought, “OK, we need a platform-as-a-service solution. What’s the container orchestration engine?”
So, we started out with that, and then we thought, “Why not run all the things in there? Why are we even trying to run anything outside of it? Let’s just run all the things in there, because we’re extremely lazy.” And basically, our platform-as-a-service solution takes care of the containers for us, and it makes deployments a lot easier. You just talk to an API instead of talking to each and every respective host separately.
Scaling: It’s the same thing. You don’t need to manually start more containers. You just talk to the API, and then off you go.
Also, the higher fault tolerance—I don’t like being woken up in the middle of the night just because one Docker container somewhere died. I need to manually restart it. That’s not what I want. So if you have that in a container orchestration engine: Container dies, no one needs to respond, no one needs to be woken up. Obviously, that’s only half the truth; if it constantly keeps on dying, you do want to manually check on it, because there might be an issue.
So all the things really came together. That is what we ended up with. So what you see, the only thing that looks overcomplicated, that’s just the various OpenStack components. As I said, we don’t have to worry about that. All our infrastructure is as code, so Terraform is taking care of talking to all those various components, to create the virtual machines, create the instances, create the subnets, create the networks, the security groups, all of that. All we have to take care of is the thing that is in the cloud logo thing, and that’s pretty much all that we end up with.
As you can see, there is a saltmaster. We do need some minimal configuration management, because the other 2 components there, that’s a PanteraS master and a Nomad master. We somehow need to configure them. The cool thing is, with that setup, we don’t need to treat that Salt master as a snowflake, because it’s disposable. It’s just doing something once; the other instances are created and then never again. So if that Salt master dies or is corrupted or whatever, just trash it, spawn it again, and then pull the code. That’s it. End of story.
And then you see that we have PanteraS as a cluster and Nomad as a cluster. With PanteraS it’s basically an in-house-built, open-source solution, which is a glued-together Marathon-Visio-Consul-Registrator-and-Fabio-as-a-load-balancer. And that’s where we deploy all our Stabilus applications. So pretty much everything that the PD teams develop goes there.
And then we have Nomad cluster, Nomad master workers. That is basically where we deploy everything that is persistent or in the persistence layers, and also streaming kinds of things, so databases, Kafka, whatever. That has 2 cool things and 1 weirdish reason. So why do we do that? Why do we have 2 different stacks? That’s for historic reasons. We had very good experiences with our own open-source solutions for Stabilus applications, and we were not happy with how it performed with Stateflow applications, so we thought, “Let’s just try Nomad.” And it actually does the job really well.
In addition, it also has the cool feature with the batch jobs that we really love, and like a lot more than Chronos, which would be the equivalent in our PanTerraS stack. The other thing that’s really cool about it, that more came out of a side note and not intentionally, which is you also have an extra level of isolation between our persistence layer and the front-facing applications. Meaning, it will never, ever happen that, for example, our database instance is on the same container as an application made by our dev teams, because they’re just 2 different stacks, 2 different sets of hosts. So that will never happen to us. That’s more on a side note, and that’s a really cool feature.
And that’s what we have in production, non-production, and the dev tenant, and hopefully, we should have that at the end of the demo.
So that’s just part of the wrapping scripts and the makefiles that, at the end of the whole process, spawns up 3 tabs with the things that we consider to be most important for our developers. That’s the Marathon UI, so that they can see, “OK, is something running here, is it working?” When they start deploying applications, they can also see, “OK, is my application running or not?” And then we would have here the Hashi UI, so that they can spawn on their own persistence layer the parts that they need for the development process, like, let’s just say a MongoDB, they could spawn it up in there.
I told you we were putting everything in containers. “All the things” is a bit of a white lie. We’re really trying to. We’re doing our best to do that, to follow our principles, to benefit from the things that we talked about. At the end of the day, it’s a bit of a white lie because it’s not fully possible for us due to a bit of time constraints and security.
It’s extremely hard for us to stick Vault in containers and run it on top of the schedulers, because we’re using Vault in a very sophisticated way of putting everything in it. We have our secrets in it, our confirmation values; we have everything in that. Meaning, if there is no Vault, I can basically not spawn any other container. So that’s the reason why we’re not putting Vault in a container. We’re running that aside the extra cluster.
The only other thing that we do not have in a container is everything that’s based on ZooKeeper, because it’s painful. It’s extremely painful for us to run ZooKeeper in containers in the schedulers, because ZooKeeper needs all the nodes of its cluster known to it at the start of the cluster. Which is fine if you just initially deploy a ZooKeeper cluster inside Drill and, say, Nomad. But now let’s say one of the containers crashes. It gets respawned; it lives somewhere else. Then, how do you tell the pre-existing 2 ZooKeeper nodes, “OK, your third custom number changed?” You put it in a config file. We could have scripted that, but then ZooKeeper does need a restart in order to apply the new config values, and then guess what Nomad would do when I stuff ZooKeeper inside the container.
That’s the only 2 things, really.
And now is when the demo was supposed to pop up.
Let’s just imagine it would have worked.
To be honest, at the end of the day, it worked a lot better than I was expecting it to, because as a backup, I had that video screen recorded already, in case things would have really gone bad.
Now, we have a minimum setup: 3 applications deployed. Amongst them are a deployment tool that just gives developers a graphical user interface to deploy their own services, and then they’d be good to go to start the testing.
We’re hiring! So if you think that was cool stuff and you really want to work on it, cool. Come talk to me, work with us. If you think that’s total shit and we should all change everything, also talk to me. We are open for anything.
We’re doing a Q&A on Twitter if you want.
Thank you for coming, thank you for being here, and enjoy the rest of your conference.