How BCG Used the HashiStack to Build a Secure, On-Demand, and Scalable Machine Learning Engine
Dec 19, 2018
Watch this presentation to see BCG's case study using the HashiCorp stack for a scalable, secure AI platform.
Production-grade machine learning is a complex problem, as addressed by companies or agencies such as Google, Uber, & DARPA have pointed out in media.
What may start out as a proof-of-concept quickly snowballs into a full-blown product, where the data was never controlled or versioned correctly, and the code base quickly becomes a series of glued components with no explicit testing.
The problem is further compounded by the underlying infrastructure for training, validating, testing, and outputting the finalized model—being neither scalable, repeatable, or secure. Boston Consulting Group's Gamma X has developed "Source" as an engine to address the elements mentioned earlier, with the HashiCorp Stack as the underlying technology to ensure that the infrastructure is always: scalable, repeatable, and secure with a zero-to-full deployment in roughly a few hours.
Additionally, Source can spin client cases up within 10-12 minutes with the appropriate resources, which are completely isolated from all other casework, ensuring no sensitive data leakage between teams, clients, or personnel as well as ensuring that clients can flexibly discontinue and continue their work right where they left off. This talk is presented by Andrea Gallego and Allen Chen of BCG.
- Andrea GallegoGamma CTO and Principal, BCG
- Allen Chen Principal Architect, Boston Consulting Group
Andrea Gallego: I’m very surprised by the attendance, given that this is probably not the typical consulting group chat at HashiConf. I will start off by saying Gamma and Source, which is our new software, come from a very small and growing practice inside of the firm Boston Consulting Group that is focused only on AI and analytics at scale.
I think that’s a loaded term, with a lot of buzzwords, so we’ll decipher that a little bit as we go along today and discuss how we’ve built our software, how we’ve used basically the entire Hashi set to do that, and then open it up for some Q&A.
What we have discovered over time is that you have data engineers, data scientists, and DevOps people, and while they all want to come to the same end goal, getting there is not so easy.
You have data engineers that want to do their own thing. They want to write their own data pipelines in their own code and their own tool of choice. You have data scientists that want to write in every flavor and every language imaginable to man: Python, R, Tensor, Julia, Rust—name your language, they want to write in it, they want to try it, they want to test. And they should. This is their ability to explore. And then you have the lovely DevOps team, whose responsibility it is to put that all together and push it into production in unimaginable times. So, how do you do that?
We took on the task of building a tool, fundamentally deploying AI in containers in a seamless and effective way that allows the data scientists and data engineers to do what they do best: code, code on what feels like a local laptop, use Jupyter and GitHub and Airflow and Luigi—what feels good for them—while underneath is a very well-structured and proprietary stitched-together use of infrastructure to allow us to deploy the productionized applications at scale. When I say at scale, I mean on any infrastructure: cloud, bare metal, you name it—in any time frame, and with multiple model runs. So if you have 2 or 3 containers and you want to run those in parallel, or you want to rerun the model in parallel and deploy that, you can do that as well, all live using our tool.
Introducing the Source AI tool
Today is our proud day of launch. Today Source 1.0 has come out. It is live, and you can see the tool, the demo, and what it’s supposed to do on Source.ai and ask us for more information if you want. We’re excited, finally, after what I think is 10 months, to go live with our version 1.0.
So, what does 1.0 do? We have an analytics lifecycle, and what that means is you have your data coming in and, like we all know, your apps and dashboards coming out. And we all know the data problems happening. We have lots of inconsistent quality, duplications of efforts across lots of different roles. Scale and productionization are not always valid. It takes lots of manual processes to get all that stitching and integration happening.
Even if you think about analytics at a high level, but if you go underneath that, you have to do all the security, all of the secrets management. You have to do all the login. You have to do all the single sign-on. You have to make sure that when someone is writing a model, there’s no dead-end code path. You have to make sure there are no passwords being written in plaintext. That’s all managed for you here automatically.
Models are not designed to be structured and to be scaled. This is the whole point of making data science and exploration world. We help with that. And lastly, everything does get released to be available in an app or a dashboard. We are closing that lifecycle, and when you think about the software engineering supply chain, what we are doing is bringing the best practices of that into AI.
So with that, I’m going to hand it over to Allen, who’s also a principle at the group and our head of product, and he’ll speak about the challenges we faced in doing this with HashiCorp and how we’ve overcome them also with Hashi.
Allen Chen: Awesome. Thank you, Andrea.
Just a quick introduction for myself: My name is Allen. I’m a principal at BCG Gamma, and I’m a member of the Source team.
As you can probably tell from some of the things that Andrea just spoke about, building a robust, scalable tool that spans the entire data science lifecycle leaves us pretty exposed to a pretty large surface area, which has created a lot of interesting technical challenges for us.
3 tasks for HashiCorp products
In addressing some of those challenges, we’ve used HashiCorp products quite significantly in 3 primary ways. The first one is dynamic credentials management. The second is repeatable creation and management of our infrastructure. And the third is portability to other cloud vendors. I’d like to spend a little bit of time today doing a deep dive into the first 2 to share a little more about what our problems were and some of the ways that we’ve chosen to address those.
As a company that deals with many different clients simultaneously, we’re oftentimes exposed to their very sensitive client data, and as a result of that, security is really a paramount concern for us. Among many other things, it’s our duty to 1) make sure that we protect their data, 2) do everything that we can to prevent any data breaches, and 3) comply with any relevant regulatory restrictions.
To ensure that we’re able to meet these requirements, what we’ve done is built Source in such a way that every single client that is running on Source has their own separate cluster to do all of their compute and to store all of their data.
These clusters could all be in the same cloud vendor, separated either at the network level or the security group level, or they can be deployed into an entirely different cloud vendor.
We’re always bringing up new clients, we’re bringing some clients offline, and between all of these clients, we have data scientists that are constantly juggling multiple clients at once.
Even though we want the isolation between the clients, in order for us to create a really smooth and streamlined user experience for the data scientists, what we wanted to do was have all of that access happen from a single, multi-tenant frontend. What this means is that data scientists who can access one or more cases need to be restricted to only accessing those clusters for those clients.
Vault for credentialing
Now, if we were building a system where all of the functionality was available to use just from some basic UI elements, that might be easy. We could do all the access control from the frontend layer. But for data science, we often need to provide these data scientists pretty low-level access.
For example, as you may be familiar, Jupyter Notebooks is a pretty common tool for data scientists to use to build and develop their models. When we give these Jupyter Notebooks to data scientists, we generally like to give them SSH access to these clusters to give them the most power, and in order to do that, we could have Source manage a bunch of static credentials on their behalf, but that would put us at risk of credentials leakage and it would also prevent us from ever being able to revoke any cluster credentials.
This is where Vault comes in for us. We’ve leveraged Vault to help manage per-user, per-cluster, unique dynamic credentials.
Before I dive in here a little bit more, it’s probably worth sharing a little bit more about what’s going on under the hood in Source. Each of the clusters has a critical piece in there called Pachyderm. Pachyderm is an open-source library that does 2 core things for us. It is a version file system—they call it Git for data science. It also is an engine for running scalable data pipelines.
What this allows us to do is to build lineage and versioning into every model that is being trained and built within Source. Why is that important? Well, in order to build reproducible data science, you must be able to take an output, trace that back through the exact version of code that produced that output, and also trace that back further to the exact version of input data that was fed into that code.
Normally, when you run a piece of code and you have your output artifacts, all that information is lost. With the help of Pachyderm, when you build a model within Source, we’re able to provide that full provenance to you so that anything that you do is fully reproducible.
Pachyderm itself has an authentication mechanism to ensure that only the people with granted access can access the file system and access the pipelines that are running. We worked with a Pachyderm team to build a Vault plugin that allows us to very easily manage all those credentials in a dynamic way.
As you can see in this diagram, what happens is that when a user wants to access a particular client’s cluster through Source, we pop the Vault, we grab the token from them, and then we allow that user to use that token. We’ll inject it into any environments that they’re spinning up.
For example, you can spin up a managed Jupyter Notebook from within Source, we’ll inject the dynamic Pachyderm token in there so that they don’t need to worry about how they connect to a cluster and its underlying resources. As long as their session is active, we’ll use Vault to continue to renew those tokens, and then, when their session expires, we’ll revoke those tokens.
What that allows us to do is have very seamless access for users without having them to deal with their own credentials. And then, when they’re signed off, their credentials are invalidated, and there’s very little risk of leakage, and also, most importantly, we’re able to revoke those credentials when we need to.
Managing infrastructure with Terraform
The next thing I’d like to talk about is how HashiCorp products help us quickly and repeatedly create and manage our infrastructure. As we just spoke about, we have distinct clusters for every client, but there is still a common core shared infrastructure that sits underneath all of those clients’ specific clusters.
The first thing that we do when we’re spinning up a full instance of Source is that we want to get the core infrastructure and the VM images set up, and we want to do that very quickly. What we do here is we use Terraform to set up the infrastructure.
Additionally, we use something called Kops to set up our Kubernetes clusters. If you’re familiar with Kubernetes, you may be familiar with that tool as well.
Packer for cloud portability
In this stage, Packer also plays a pretty critical role for us. What it does is we pre-bake all of our AMIs using Packer. We don’t use any configuration management tools like Puppet or Ansible. What we do is we create a separate AMI for every version, and then we use user data within AWS to configure those when we deploy.
Another thing to note is that we are encrypting all of our AMIs using KMS in each environment that we deploy into.
And a third point is that Packer makes it really easy for us to port all of this data into a different cloud vendor. We’re not locked in to a specific vendor.
In the next phase, we use Terraform Enterprise as our state management. In the third column, we then deploy some of our other critical core tools into our infrastructure. Some of these are Hashi products. So we also deploy Vault and Consul, and some third-party tools like Docker DTR for our private Docker registry. And we use CircleCI for our continuous integration and continuous deployment.
And in the last phase of this core deployment, we deploy the actual Source application into the Kubernetes cluster that we created, and then we also set up CircleCI and activate all of the CI/CD pipelines.
Once we have that core infrastructure set up, we can leverage some additional tooling to spin up those client-specific clusters that I spoke to you about earlier. We built something called Source CTL, which is our command-line tool that orchestrates all of those deployments for us.
We run that from within the core infrastructure, and what that does is spin up additional Kubernetes clusters for each client. It’ll deploy Pachyderm. It’ll interface with a few other core components and their APIs to get things set up for a new client.
We will then have a new client cluster up and available within a matter of minutes, and it’s very easy for us to onboard new clients that are completely sandboxed from everybody else.
Vault as a secure metadata store
You’ll notice here that Vault plays another role that is distinct from the one I described earlier. Earlier I was talking about how Vault manages those user-level dynamic credentials to access the client case clusters. We also have a separate shared instance of Vault that sits in the core infrastructure, and what that allows us to do is have a secure metadata store for Source CTL.
In there, we also have a custom-built GitHub plugin that allows us to quickly create GitHub organizations within our enterprise instance for each client. So whenever someone gets a new client cluster, they also get a corresponding GitHub org to keep everything as separated as possible.
By using this separate Vault instance in our core infrastructure that our orchestration tool leverages, we’re able to have a safe, secure metadata store that allows our tool to be item potent, so if things fail on cluster creation, we can continually retry without worrying that we’re creating a lot of extra infrastructure noise and creating copies of clusters that have previously failed.
So that’s the 2 deep dives that I have for you today.
We wanted to leave it open for you guys to ask some questions about what we’ve done, anything that you’ve seen, or something that you might be curious about that we didn’t speak about.
It’s a real pleasure to share what we’ve been working on with you. We’re super excited about what we’ve built, but also what things HashiCorp products have enabled us to do. It’s still early days for us, but we’d love to engage with you, hear your thoughts, see what your questions are.
Anyone have a question? Feel free to shoot a hand up. And I’ll ask my colleague Bart to come up and help. He knows the system better than I do, so if you stump me, I have backup.
Audience Member 1: Is there anything about Terraform 0.12, or about Vault 1.0 that you’re particularly interested in?
Allen: I have been stumped. Bart?
Bart: No, we haven’t really taken a look at it yet. We are using Terraform Enterprise for now only as state storage, but later on for Vault we’re also going to enforce policies on our cloud infrastructure.
Audience Member 2: I wonder how you guys engage with clients. This is a pretty complicated pipeline. Do you bring this pipeline to the client and implement everything for them? Do you work with their IT teams? Do you educate them? How does the collaboration go?
Allen: The way that we see Source is as the hub between 3 very important components. At the top is the client, and at the other 2 points of the triangle are BCG Gamma and BCG traditional.
Typically, when we engage with a client, it’s not just BCG Gamma or just BCG traditional, the strategy consulting part of the house. We see Source as a tool that sits in the middle that allows traditional BCG consultants to work in the system, run models, train models from the UI.
It gives BCG Gamma data scientists the capability to build their models and test their models in the same location. And because of some things like single sign-on and sign-on federation, we’re able to bring on the clients’ data scientists as well as their business operators into the same platform.
The goal is that we can use it internally to deliver a model, but the goal is to have it as a lead defined for the client. Once we finish our work, they will obviously want to continue iterating on their model. We would like them to continue to do it on Source, and train them. And part of our goal is to enable them to continue that work so that it becomes a long-term, successful, and viable program for them.
Audience Member 3: Just a quick question. I have some data science background, and I’m trying to compare how Domino Data Science Platform or Dataiku fits into this framework.
Allen: This is obviously a space that people have been thinking very hard about. Domino Data Lab has been around for a while, and they have pretty good support for cloud-based data science tooling. And Dataiku is a tool that we’re very familiar with.
Dataiku, we feel, is a tool that represents the data science lifecycle quite robustly. They have tools from data wrangling and cleaning all the way to model deployment.
The way we see it is that BCG Gamma has a very specific and interesting perspective on the world for data science.
We have the luxury of working with some of the biggest companies in the world, and we have these use cases at our fingertips that give us a sense of exposure to the types of scale and problems that a lot of the other data science companies are only able to see secondhand through their clients, whereas we see it very firsthand.
While Source is early days for us, we’re trying to position ourselves to leverage the type of exposure and the visibility that we have into the types of problems that real big companies have and building around those capabilities.
For example, one of our unique characteristics is having that hybrid infrastructure of having the compute sit in distinct clusters while having that nice, seamless UI that’s multi-tenant. That is something that was largely motivated by the requirements of the client engagements that we see, and we see that as our unique differentiator.
Audience member 4: It looks like you guys use most of the HashiCorp products except Nomad, so just wonder if you have tried to use Nomad and encountered any difficulties or is there any concern of using that?
Allen: Bart, you want to take that one?
Bart: There are no concerns about it, but we didn’t really have anything to use it for. So, we just manage our jobs with Pachyderm, which runs our pipelines, and we just never tried to use it.
Allen: I’ll just quickly add that a fair amount of the team already had pretty deep expertise and knowledge in Kubernetes. That’s served us quite well so far, but we’re quite open to exploring what opportunities Nomad might present for us as well.
Audience Member 5: How do you hydrate the cluster with data, and how do you have the user roles and authorization flow over to the data?
Allen: As you might remember, we use something called Pachyderm in our cluster and Pachyderm itself has a version file system. With that version file system, all of the authentication policies go along with all the files and the datasets that are in there.
And we have tooling that, if you want to upload a small dataset into your version file system for a cluster, you can do it through the browser. That’s largely for exploration and experimentation.
We have tooling that allows you to schedule jobs and connect with third-party data sources, whether it be FTP servers, S3 buckets, Redshift clusters. You can build all of those integrations into Source and have that data pumped into the Pachyderm cluster.
The nice thing about that is, let’s say you schedule it to pull data every single day. Because of the versioning, you can constantly rewrite that same dataset over and over again, but all prior versions are captured, and they’re also authenticated depending on whoever the user who’s setting up that job is.
Does that answer your question, or maybe I missed part of it?
Audience Member 5: No, I think Pachyderm is the thing to look into. We have a similar user case and I was just wondering how do you have the permissions flow through the entire stack?
Allen: We leverage Vault very significantly in managing all of our credentials, and we use that as our hub to ensure that we do have authentication flowing through our entire stack. Pachyderm is one piece of that.
But it’s a challenging problem, and it’s not one that can just be solved out of the box. It takes some care and caution to make sure that you don’t have any gaps in any of the linkages that you have in the system.
Audience Member 5: Just a follow-up question to your models once you’re deploying. How do you monitor, and how do you make sure that it doesn’t degrade performance, and how do you update? Is that part of Source or is it an extension?
Allen: It’s something that we’re building into Source right now. When you think about data science, building the model is one thing; how that model ultimately gets used is a separate concern. You can have a model that is used in both an API for online predictions, or you could be using that same model in a batch job for batch predictions, or you can embed it into an actual application. How you do that monitoring depends very much on the context in which you’re trying to deploy that model.
We have a concept within Source where we separate the model itself from the shell that you are injecting that model into. The contextual awareness that those distinct shells have allow you to monitor in a way that is appropriate for the use case. Because how you monitor a desktop application is obviously going to be very different than how you monitor an API or a batch job.
If there are no more questions, I’d just like to thank you all for your time. I hope you found some of this interesting. Again, you can learn more about Source at Source.ai and we’ll be floating around the conference, so if you see one of us with a BCG badge, just come grab us and we’d be happy to chat with you.
Thank you, all.