Controlling Thousands of Machines. Aka, My Day Job.

I promised my friend Toby that I’d link to one of his blog posts. Toby is one of the SREs that I work with at twitter, and let me tell you, you always want to stay on the good side of the SREs!

And to introduce it, I actually get a chance to talk about what I really do!

Since last July, I’ve been working for twitter. I haven’t yet gotten to talk about what it is that I do at twitter. For obvious reasons, I think it’s absolutely fascinating. And since we just recently released it as open-source, there’s no reason to keep it secret anymore. I work on something called Aurora, which is a framework for Mesos.

Let’s start with Mesos.

When you run a web application like Twitter, you can’t run your application on a single computer. There’s a bunch of reasons for that, but the biggest one is that there’s not a computer in the world that’s big enough to do it; and if there was, depending on one epically massive computer would be incredibly unreliable. Instead, what we do is set up a data center, and fill it up with thousands upon thousands of cheap machines. Then we just use all of those thousands of little computers. These days, everyone does things that way. Twitter, Google, Facebook, Amazon, Microsoft, Apple – we all run on gigantic clusters of cheap machines in a datacenter.

But there’s a problem there. How do you use a thousand computers? Much less 10,000 or a million, or more?

What you do is build a substrate, called a cluster management system. You write a program, and you say that you want to run 1,000 copies of it, and hand it to the cluster manager substrate. The substrate takes care of figuring out where to put those processes, and telling the specific individual machines that it selects what to do to run your program.

One of the hardest problems in a cluster management system is figuring out how to assign programs to machines. Depending on what a particular program needs to do, its requirements can vary enormously. Running a quick map-reduce has one set of requirements. Running a reliable service – a service that can survive if part of a datacenter loses electricity – has a different set of requirements.

One approach – the one used by Google when I worked there – was to define a constraint language that ultimately had hundreds of possible parameters, and then treat it as a classical optimization problem, trying to optimize over all of it. The good side of that is that it meant that every program at Google could express its requirements exactly. The bad side of it was that the constraint system was incredibly complicated, and almost no one could predict what results a given set of constraints would produce. Configuration turned into a sort of game, where you’d make a guess, look at what the borg gave you, and then modify your constraint – repeat until you got something satisfactory.

mesos_logo At Twitter, Mesos is our cluster management system. It takes a completely different approach. Basically, it takes the borg-like approach and turns it upside down. Mesos itself doesn’t try to do constraint satisfaction at all. What it does is talk to components, called frameworks, and it offers them resources. It basically says “Here’s all of the possible ways I can give you a fair share of resources in the clusters. Tell me which ones you want.”.

Each framework, in turn, figures out how to allocate the resources offered to it by Mesos to individual jobs. In some ways, that might seem like it’s just deferring the problem: don’t the frameworks end up needing to do the same allocation nightmare? But in fact, what you do is write a framework for a particular kind of job. A framework needs to work out some constraint satisfaction – but it’s a much simpler problem, because instead of needing to satisfy every possible customer with one way of expressing constraints, each framework can decide, on its own, how to allocate resources to one particular kind of job. Map-reduce jobs get one framework that knows how to do a good job scheduling map-reduces. Services get a framework that knows how to do a good job allocating resources for services. Databases get a framework that knows how to do a good job allocating resources for storage and bandwidth intensive processes. As a result, the basic Mesos cluster manager is dramatically simpler than a one-size-fits-all scheduler, and each of the component frameworks is also much simpler.

You can see a really cute sort-of demonstration of how Mesos works by looking at @MesosMaster and @MesosSlave on twitter.

I don’t work on Mesos.


I work on Aurora. Aurora is a framework that runs under Mesos. It’s specialized for running services. Services are basically little server processes that run a lot of identical copiesfor a long time. Things like web-servers – where you need a ton of them to support millions of users. Or things like common processes that live behind the web-server answering requests for particular kinds of information.

At Twitter, all of the services that make up our system run in our datacenters on top of Mesos. They’re scheduled by Aurora.

With Aurora, running a service is incredibly easy. You write a configuration:

my_server_task = SequentialTask(
  processes = [
    Process(name='stage_binary',
        cmdline='hadoopfs-copy /glorp/foo/hello_service .'),
    Process(name='run_service', cmdline='./myservice')
  ],
  resources = Resources(cpu=1.0, ram=128*MB, disk=128*MB))

jobs = [
    Service(task = my_server_task, 
        cluster = 'mycluster',
        role='markcc_service',
        environment = 'prod',
        name = 'hello',
        instances=300)]

The configuration says how to install a program and then run it on a machine, once Aurora assigns a machine to it. Aurora will find 300 machines, each of which can dedicate one full CPU, 128MB of memory, and 128MB of disk to the process, and it will run the service on those 300 machines.

Both Aurora and Mesos are open-source, under the Apache license. We’ve got a test environment in the distribution, so that you could be running tests in a virtual Mesos/Aurora cluster one hour from now! And my good friend Toby wrote an excellent blog post on how to set up a working Mesos/Aurora cluster.

I don’t want to toot my own horn too much, but I’m incredibly proud that Twitter open-sourced this stuff. Most companies consider their cluster management systems to be super-proprietary secrets. But at Twitter, we decided that this is a problem that no one should have to solve from scratch. It’s something we all know how to do, and it’s time that people stop being forced to waste time reinventing the same old wheel. So we’re giving it away, to anyone who wants to use it. That’s pretty damned awesome, and I’m proud to be a part of it!

3 thoughts on “Controlling Thousands of Machines. Aka, My Day Job.”

  1. So what’s wrong with the Google solution? Is it just that it’s very hard to state the constraints in such a way that the program doesn’t find an efficient and useless “solution”? (I read something about how George Dantzig, the inventor of the simplex method, tried to use linear programming to create a diet in the 1950s; it give him solutions involving hundreds of gallons of vinegar or hundreds of bullion cubes before his wife said it was good enough for her to work from.)

    1. I don’t know that I’d say that the Google solution is wrong. It’s just one point on a spectrum.

      When you’re trying to schedule thousands of tasks on 10s of 1000s of machines (which is understanding the complexity of the problem!), it’s hard. There are so many competing issues.

      For scheduling reliable services, you want to try to optimize for reliability. That means that you want to ensure that the processes aren’t too close together. In the datacenter, you don’t want your 500 processes to get allocated to 500 machines in the same rack: then all it would take is one person tripping the wrong circuit breaker, and your entire service is dead! It probably doesn’t matter much what kind of network interface or CPU you’ve got, but getting the distribution right is critical!

      For scheduling mapreduces, you want to try to optimize for locality. You’ve got a ton of processes that are performing computations, and then sending their results off to other processes. The more distributed the machines are, the worse the performance will be. You want the individual machines to be fast, and to have the fastest network connections you can get.

      For scheduling database servers, you don’t really care much about where the servers are on the network relative to the clients, but you really need them to be close the storage units. You probably don’t care too much about the CPU, but you need something with massive IO bandwidth.

      For scheduling machine learning servers, you probably want something with a really fast CPU, local SSDs, and good access to shared storage.

      For workflow graphs (like the Storm framework), there’s another different set of requirements.

      All of these requirements need to get expressed somehow.

      The Google approach is to create one massive constraint system. That works, which is just about the highest praise you can give in this area. Look at Google: It works!

      The Mesos approach is to create a collection of smaller, simpler constraint systems.

      Which is better? I prefer the smaller, simpler approach. But at the moment, we don’t have any objective way of measuring which is “better”. No one who has access to the information about Google’s system and how it performs is allowed to share enough info to do an objective head-to-head comparison. So we’re left speculating and arguing from aesthetics.

  2. I can’t say that I understand the solutions you describe, but I never really thought before about what it means to run a massive Web service like Google search or Twitter on giant datacenter clusters. Interesting to think that when I run a Google search or look at a Twitter feed, there is some particular cheap computer (I guess these all run Linux?) actually running the program and answering my request.

Leave a Reply