Having a research department early on in its life helped separate HashiCorp tooling from its many competitors. Learn how your teams can also learn to discover and use academic research to improve your products.
When HashiCorp was less than 40 people, we decided to incorporate a research department. Terraform, Vault, Nomad, and Consul were all coded after pouring over dozens of CS research papers, and because this approach has been successful in creating uniquely powerful engineering tools, HashiCorp has continued to focus on industrial research at its core, working 18-24 months ahead of engineering on novel work that can be incorporated into existing products, or create new ones.
In the closing keynote from HashiConf EU 2019, HashiCorp's head of research, Jon Currey, walks through HashiCorp's process for discovering and using academic research so that other organizations might learn how to introduce research into their teams' work as well while respecting the business realities of production software development.
Hey, everybody. It’s a real honor to get to close things out here. Obviously, mixed feelings.
I hope you’ve had a good time the last 2 or 3 days. I have absolutely loved hearing all the different experiences from our users, both here on the stage and of course connecting with people in the super-cool hallway track here. What an amazing venue. I thought we were at a food festival yesterday.
There’s a little sadness, but, to quote T.S. Eliot, let’s see if we can go out with a bang and not a whimper.
You don’t have to try really hard to find us from HashiCorp on the internet talking about our use of research. We’re out there. We are talking about and promoting the use of research pretty extensively. And this goes all the way back. On screen now is Armon’s first commit on Serf, back in September of 2013. There are some design documents that you can find internally that show that before this first commit, they had already gone and looked at research.
At this point they decided that they were going to implement SWIM. The SWIM gossip protocol is an underpinning of Serf for failure detection and discovery. But before they got to that point they did a really serious investigation. They looked at a bunch of the other competitive research: Plumtree, T-Man, HyParView.
Actually, you can go back even way before this—the value of research was appreciated. Mitchell and Armon met at the University of Washington when they were both working as undergraduate research assistants in a research lab. They were both separately drawn to the value of research, and that’s how they met.
The University of Washington is very close to Redmond and the headquarters of Microsoft. Microsoft Research really does stand out as an industrial large company really being committed to academic research in a big way. There’s something of a revolving door between the University of Washington and Microsoft Research. You’ll see a lot of professors teaching there. You’ll see a lot of students at all levels, and professors. There’s a steady flow, and people have appointments at both places.
It’s fair to say that HashiCorp loves research. If you go in and look through our products, in the documentation, you will see it. There are pages explaining in the internals all the different pieces of research that a particular product or tool, and the open-source tools, rely on.
Interestingly, most of this is 100% available in the open-source version of the tools, because our philosophy is that we give away all the stuff about stability and performance for free. And a lot of the research was needed in order to make sure that we had stable and highly performant systems. So practically all of it, if not all of it, is in the open-source products.
And it’s interesting to see how the use of research matured at HashiCorp. I am the director of research at HashiCorp. I’ve been at HashiCorp for 3 years now, but the use of research in the way I’ve described, as I say, is from day one, and it completely predates the notion of there being a research team.
And it’s interesting to see how the use of research has deepened and matured with each of the projects. In most cases they had a fundamental problem—scheduling, replication, fault tolerance, fault detection, and availability—one fundamental problem that they went to research, and they would evaluate multiple pieces of research and choose one as the base technology.
But once things are deployed, you start to experience things that the academics didn’t see. You start to see different real-world scenarios; they couldn’t possibly have tried all these scenarios. They start to see a larger scale.
SWIM famously was tested on 55 computers. It’s amazing how well it worked. But when you took it to the scale of hundreds and then thousands and then tens of thousands of nodes, it starts to show some flaws that they couldn’t have anticipated at that scale. So over time we go back to the research, we find additional technologies. Initially we sort of look at them and deploy them as is and separately.
But then it gets more complicated. You start to see, “Well, I could use this technology as an offset or a complement, and we can combine these technologies.” And then eventually, you say, “OK, now that we’ve consumed 2 or 3 technologies that are working together in an area, we feel we can start to extend these things.”
After a while you get to the point where you’re kind of doing research. You might not be publishing it, but you are doing the method of investigating.
So at some point it was time to give back, and that’s when HashiCorp Research came along. We have a charter; we’re an industrial research lab. That means we’re doing the research in and for industry. It’s being applied into open-source tools and commercial products.
But to be research, it has to be novel. We’re doing academic research. I’ll tell you the definition of that in a second. And typically we want to be somewhere like 18 to 24 months out from what engineering would do anyway without our help.
To play, the gold standard is peer-reviewed publication. You want to get your new paper accepted into a conference or a journal or some forum where there’s a peer group. You’re being reviewed by other researchers and there are criteria that are not negotiable:
Make a novel contribution
Consider all relevant prior work
Adhere to the scientific method
You have to make a novel contribution to the literature, same as a PhD. “What are you bringing that’s new? We’re not going to publish it unless you bring something.” Though these days there are experience papers, there are more industry-leaning tracks. But if you’re going in the main academic research track at a conference, you need to bring something new.
You have to have considered all the relevant prior work. Sometimes it can be a gray area, what is relevant or not, but there’ll be usually in the community a consensus about the line of work. This is really the only way that the community can make progress. You have to be comparing yourself to the previous work, so that there’s some consistency of evaluation. If we don’t do this, then people are just going off in different directions; it’s not really an academic community anymore.
And you have to adhere to the scientific method. You have to have a hypothesis, you have to test it, you have to be ready to be wrong. This should be science, computer science.
But doing this in industry adds some additional requirements.
We’re a very small team. We need to be able to keep up with the different areas of academic computer science research that are relevant to the existing tools. Certainly to the existing features of the existing tools, new features of the existing tools, but also potentially whole new categories of tools.
So there’s a lot of ground to cover, and you need to do that in an efficient manner, both in existing and new projects. We can’t just look at the evaluation that some academics gave us and say, ”That looks great, but that’s not the scale, that’s not the environment.” We have to make sure that we evaluate this research against the actual domain, the scale, and all the other requirements that our real users have.
We are a small team and we would like to have effective transfer. One of the big problems you can have with a separate research team is that they don’t necessarily connect well with the product team, and you don’t get a good tech transfer. So we need to make sure that we engage the engineers. We need to hear their requirements. We have to have it so that it’s not some big surprise when we come along with some solution.
I’ve been in industrial research now for over a decade. There are things that work and things that don’t work. And we have a set of approaches, which is what I’d like to tell you about today.
And it’s not glib to call this “research as code.” We have a lot of “x as code” in this community, and it turns out that this fits really well. It’s like with DevOps: Nobody takes the whole of the agile methodology; there are certain core things from DevOps, agile, and lean methodology that we have found work really well. There are practices for doing research in this setting. Having immutable, versioned artifacts is super important. I’m going to show some details about that. And then the whole backlog and iteration method is going to save everybody a lot of pain.
And we want to share the tooling and the processes from engineering by default. If we can’t use their tools, why not? Maybe we should fix or improve their tool rather than going off and having something separate. And again, we want to foster a culture of collaborative consumption of research. Everybody’s in this together. Anyone can bring their pieces of information to the table and we’ll discuss it.
Now I’ll tell you about how we try and make these things happen at HashiCorp.
Research runs on—well, sometimes they say citations, sometimes they say references. Strictly speaking, the thing on the left of the screen is a citation. A citation is a mention of some external source. It’s inline in the body of the text and it’s terse. Usually it just points to the reference section, which can be a URL or an embedded URL. But really it’s supposed to be the “forever” way of finding that thing. So title, authors, the form of publication, and so forth.
And this is the power of academic research. Because people reference all of the previous relevant work, they cite it, or there’s a better back-and-forth in terms of the terms “cites” and “reference.”
On the screen now is a citation graph. It’s like a web graph, except the nodes are the papers instead of webpages, and the edges are citations/references instead of URLs.
It’s a directed graph. So if paper A cites or refers to paper B, it’s mostly acyclic. It can take months between a paper being submitted to a conference and it being accepted. If some new relevant research comes in during that time, and if it doesn’t completely invalidate your conclusions, it’s the right thing to do to update your conclusions and reference it.
You can have an interesting situation where it can reference something published approximately at the same time. And even the 2 things can reference one another. It’s rare, and it’s not going to really be a big problem unless you’re doing some graph analysis and you assume an acyclic graph, but occasionally it happens.
And people who care deeply about this kind of thing say things like, “This is one of the most important intellectual achievements of humanity.”
And it really is, because this thing, it transcends time and space, language, culture. If we buy into this model, it’s a way for us to advance as a species. It’s kind of cool.
Amazing then that this sort of citation graph was only realized in 1965. This model had been going on since the late 1800s, but it took a history of science professor in 1965 to dig into this thing and go, “Look at this. If I schlep and do all this analysis by hand, I can find all these references between these things. And there’s some kind of graph here and it has some kind of power.”
Of course, this was made possible because a first index of academic papers was assembled by information retrieval librarians in 1961.
This was a very manual process. Of course, fast forward: There are companies who are very good at indexing things at scale, and this stuff has taken off. But a lot of the ideas that Google apply to ultimate scale, they actually went and hired information retrieval people who’d cut their teeth on this corpus and legal documents and medical. But the academic publication citation graph is one of the originators of the stuff that fed into Google being able to do this for the web.
So how do you make use of this crazy graph? Well, you should use a healthy mix, a diverse variety of methods to access and cover the graph, to try and unearth the things that would be useful in your particular context. There’s a whole bunch of different techniques here. I have not got time to talk to you about them because this is a short talk.
On the screen right now you can see a little repository that we’re going to start to use to share resources that we hope are useful to you. It’s not particularly prescriptive, but we’re starting to compile some basic stuff you should be aware of and some smart people who’ve talked about this stuff in the past.
The version that’s going to get pushed out today covers these topics:
-How to read and evaluate a paper
Tracking and making sense of it all
More is going to get added, and PR requests are greatly appreciated as well. We’d love this to be a repository of shared knowledge.
One of the things that is mentioned in there that I’d really call your attention to is open access. With digitization, the cost of production has gone down. Just like with many other industries, there are institutions that are predicated on charging people lots of money for access to information in this area.
We have subscriptions to the professional bodies like IEEE and the ACM. Even they, over time, are becoming more open with access to information. But that’s kind of the extent. HashiCorp is very committed to open access. There are some links on the repo; go read about this stuff and educate yourself, form your own opinion. But we believe open access is really the right way to go.
Once you’ve done these things, what do you do with it? In our case you have a little agile methodology, so you have a set of sources for Git. For each project we do this. We have a set of sources that we think are going to be relevant feeds for where we are going to get relevant papers.
And anyone who wants to can say, “Hey, I’m going to go read all of that conference that year. I’m going to comb those papers and I’m going to throw them on the backlog.” And so we throw them on the backlog, and then another time someone says, “I’ve got that train journey home tonight. I’ve got 2 hours to kill. Let me pick up a good paper off the top of the backlog and then let me go process that and feed it.” And we build our own internal personal-related work for this project.
We found that Google Docs is perfectly adequate for this. On screen is the top of the document where we list the sources and we also track, “OK, someone has processed that year; you don’t need to do that.” We iterate through crawling the frontier of this citation graph, forum by forum.
And then, if we are doing our job, we’re keeping up with the backlog, so the backlog is short. But somebody would grab that off of the backlog, and they would read that paper, and then they would find a place for it in our taxonomy of topics.
And if there isn’t the right topic, they could add a topic, or they could discuss it with other people. But the idea is to incrementally, iteratively create knowledge. These are papers that somebody who gets what the agenda of this project is has looked at, and they’ve pulled out maybe the few critical points to save everybody having to read the same paper fully.
And this is great, right? Because we’ve decomposed a complex problem and we’ve done the thing that we do at work, which is redundant with research. We’ve broken the problem down into small achievable tasks. It means you can chip away at it, you’re more likely to start and keep going, and you can bring people in from across the organization: “Hey, I’m not sure how to review a paper.” “OK, let’s do it together.” Over. Zoom. And then, “OK, you write it up.” Bam. Now you can review papers.
It’s not enough to look at the way that the paper has reviewed the problem, has evaluated the problem and that benchmark. That’s interesting, but we need to make sure it lines up with our understanding and what we do know about the way our users use our systems.
Again, we want to have a successful transfer to the engineering team. We have a great software design workflow at HashiCorp. It has been evolving, and it will continue to, but the core, it stood out just with the high-level design document. The product thing got added later.
But at this point, now that we have a larger organization and a more mature process, we have product managers and people who go and engage through sales and through the open-source community. We hear a lot of requirements, and we triage and make sense of those.
When we have a new product or a new feature, we write a PRD that captures different scenarios, different use cases. How we’d see: “What could we hit that would hit a number of different high-value features for different populations? How should we go forward with this next iteration of this product or this new product?”
And then that gets handed off to engineering, who write the RFC, saying, “Given these requirements, high-level design, and so on.” We don’t go blow by blow, but we include “What’s the API going to look like? What do we think we’re doing technology-wise and with semantics?”
Again, we have these for the big research projects like Vault Advisor. We went off, we wrote a PRD, we looked at the PRDs from the actual products from Vault, but then we also consulted with people and we wrote a PRD for the research project.
And then it was big enough that we followed a pattern. Consul Connect kicked off, and we had one umbrella RFC that frames the high level and connects to the PRD, the high-level requirements.
But we don’t want each of these documents to be much over 5 to 20 pages; 5 to 10 is the sweet spot. We decompose the problem and write a series of RFCs, and we even do that in research. And we found that really helpful.
We weren’t doing that until Consul Connect started doing that. We were 3 months into the Advisor project, and we said, “Yeah, let’s do that.” And it really helped us to clarify what our objectives were and how we were going to get there with Advisor.
However, before there was HashiCorp Research, which has only been 3 out of 6, 7 years of HashiCorp’s history, they were already consuming research. So you will see some older RFCs. And we still encourage this. There’s nothing wrong with the product group saying, “We know that there’s this research.”
You’ll even find there were backlogs. And when they process a paper, they will say, “What are its features? What are the pros and cons of those features?”
And this is a lot like doing market research, competitive analysis. You may even have one of those matrixes with checks: “This thing has this one; this thing has this one.” And then showing how each product or paper might use a different language. Let’s map that onto something unified so that we can really do an apples-to-apples comparison.
It’s like competitive analysis, and actually could be competitive analysis, because if you don’t implement this thing, somebody else might. It could pop up in a competitive solution. So it is both like and it is potential competitive analysis.
So that’s great for getting requirements. And also for looking at how we’re going to measure things. How do we measure things?
We have reproducible test environments at HashiCorp. This is a private project. It’s a good chance we’ll open source this. It’s only going to help you if you’re benchmarking Consul, though.
The key thing is we built this stack in a very straightforward way. We use packets and make AMIs (Amazon Machine Images). We use Terraform to bring up a metric server and lay down the foundation of the networking and everything. And then, if it’s running in depth, each developer can in the morning, or whenever they need to, spin up a Nomad cluster. And in this case you run a job and the job brings up a Consul cluster and runs the workload against it. And then we tear down the Consul cluster. We also run it in CI.
You probably have tools like this, but when you start doing research, the key point is to make sure, as much as possible, that you’re using the tools that you’re already using for this kind of thing for your production engineering and the actual workloads.
When we’d run those jobs, we spin up a Consul cluster, there are agents, and we have this Consul live client tool. The commands represent different ways of generating synthetic workloads. They’ve come about from conversation with different customers and users and from looking at the different aspects of the problem.
You can compose these things and run different numbers of instances to try and stress-test, induce different scenarios that are interesting to examine.
This is agile development integrating nicely with research. Anyone can contribute, anyone can use, and anyone can review. A sales engineer might be doing a stress test situation with a prospective customer. Or a solutions architect might be working with an existing customer. Anybody can come along and say, “Hey, I ran that scenario, but it didn’t match up with what we were seeing in Grafana or Datadog.”
Or, if they know what to do, they can go upgrade the tool, they can add a command, they can add a new use of the command. But now we have the immutable artifact, the versioned artifact, and this becomes the topic of discussion.
We immediately go in there and we see all of the work that was already done in this sphere by all of our sibling teams across the organization.
And it goes both ways. When we read research, we might say, “Huh, there’s this metric that we’ve not- been looking at.” Or, “There’s a methodology, how many times you run this thing, whether you’ve warmed the cache before you start,” and of course actual workloads, any of these things. Now when we read a piece of research, even if we don’t take the algorithm, we can take the methods from the evaluation.
The things I’ve already told you about do a great job of enabling this cross-team collaboration on research. But there’s more. We use Slack. Because we have a research team, this could be “talk research.” We happen to have a team and the team is small and we get lonely. So we welcome people coming in and talking.
The whole idea here is that there are no dumb questions. If you’re not sure about something, if you saw something and you’re curious, post the link, ask about it. I mean, Paul Banks is coming in asking excellent questions. He’s the lead of Consul and he’s an awesome research collaborator.
So we have a very vibrant discussion, but it’s completely nonjudgmental, and it’s great because you have domain experts, someone that knows a lot about replication or consistency, or how to do synthetic benchmarks.
It turns out you’ve got people across your organization who’ve maybe got pieces of the answer as you try to synthesize your own research methodology and just generally use research to up the game of the whole technical organization.
We love Papers We Love. We’re active participants. We give talks there. We sponsor their conference.
The Association of Computing Machinery, one of the heavyweights. The ACM has a preferred employer program. And through that program, any employee who wants to be a member of the ACM, we will pay their membership; we get a good discount. They have access to the digital library. It’s a win-win.
USENIX has a comparable program.
IEEE is a little bit more conservative, more traditional, but we have a membership. They have local chapters. You can go to local chapters and connect with people over different topics.
Internally, we do brown bags, we do lightning talks. Lightning talks are a newer thing. They’re really cool because they’re short. They can be embedded in an engineering all-hands, for example. So someone comes along and says, “Hey, I was using elliptic curves inside of this and they’re super awesome but turns out they’re not as difficult as you might think. I’m going to show you elliptic curves and 10 slides.” That’s a nice thing.
And that again lowers the barrier to entry. People now, when they see a paper, can get past the buzzwords and they can immediately start to be a collaborator in processing research.
Obviously, we’ve been doing this for some time, and we’ve been doing it since before we had a research team. You can do this too. And I implore you to, because having research in the DNA of HashiCorp from the start really is part of the secret sauce.
You don’t have to have a dedicated team. If you apply these techniques, if you connect to the artifacts and the processes that you already have, and if you create an open and collaborative culture of research consumption, you’ll be surprised how far you can get with this stuff while doing your day job.
Why would you bother? It’s like this: People tell you before you go to a conference, “Tell your boss why you should go to the conference.” Well, flip that around. You are at this conference, and now you’re going to go back and tell your boss why you should all start doing some research, or at least consuming research.
Yes, it’s going to give you state-of-the-art algorithms. But maybe the product has flown the coop, so you’re not looking for a new algorithm, thank you very much. Well, you might change your mind down the road. There’s V2, V7, or whatever.
But also, even if you don’t take the algorithm, what about this methodology of the evaluation? What about the metrics, the workload? Are you testing your product the right way?
Talent: You can bring in interns and full-time people based on use of research. We’ll get into that in a second. And hopefully you’re in a place where people will feel happy about the fact that you are looking at and consuming research.
Papers We Love is awesome. They have a GitHub repo, it’s a curated collection of lots of papers that people have brought. The people go to the meetup and talk about their favorite paper. You can have good Q&As. You can connect with other people who are curious about research.
They put a lot of those talks on YouTube, and they have a conference.
There are many ways to access Papers We Love. And Adrian Colyer has been in the industry for many years and takes it upon himself to write The Morning Paper. It’ll drop in your inbox with a nice write-up of a highly relevant systems paper, say, if you’re on the systems operations engineering side of things.
At work, the things I’ve mentioned, a Slack channel, having a reading group, or brown bag. And then, is there anyone at work who did some research? They may not have finished it, but they could have started.
What about the data science team? Those guys are actually living it. They are having to keep up with machine-learning research and they’re trying to build a continuous delivery, applied research platform. Talk to them; build a bridge.
You can get involved out there as well. You can email, you can tweet. A lot of professors and students are tweeting these days. If you have a program of professional development, go to an academic conference. You don’t have to go to an industry conference.
Blog or tweet about your problems. They will come to you. One of the ways they will come to you is as a Ph.D. intern, because there’s a wonderful fit here. If you’ve got a real problem and/or data that needs to be solved, that is exactly what that intern and their professor need to make their research more relevant. So connect with them.
Don’t do this: Don’t harass them over the internet. You should email them privately or go to the conference and talk to them over their poster or after they’ve presented their paper. It’s very easy to find interns who could have a great impact on your product. You own the intellectual property of anything they do while they’re there. Gets a bit more complicated when they go back, but if you don’t worry about patents, that’s not a problem. Otherwise, patent when they finish the internship.
Don’t forget that researchers are people, too. There’s this perception that “I’m outside of the ivory tower; they wouldn’t possibly want to hear from me.” But you don’t have to dig deep to find them on Twitter saying, “I wish more people would talk to me about my research.”
There are researchers standing by and ready to take your call.
And with that, we are putting this repo live. It’s just beginning to get going now. I will be PR’ing more stuff myself. I encourage you to read it, email with me privately, or send me a PR if you have more good things that you think should be in here to encourage this kind of process. And I’d love to talk to everybody about this afterwards as well.