Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

Machine Learning (ML) training is an important workload in enterprise clusters and cloud data centers today.  Products like virtual assistants, chatbots, and web search which are an integral part of our life now, are empowered by the innovations in ML and AI research. However, building such products requires massive, specialized resources like GPUs to train ML models, which is both a time consuming and resource-hungry process. To accelerate such innovations, enterprises and cloud providers have to build and manage large-scale computing infrastructures with a large fleet of GPU devices to be shared across users. Efficiently utilizing these GPU clusters requires a cluster scheduler that decides how to allocate GPU resources to the many jobs while implementing complex cluster-wide scheduling policies to optimize for objectives such as average job completion times (JCT), or user-level fairness.

This blog post provides an overview of the different facets of ML cluster scheduling optimization, recent research efforts along each dimension, and future research directions.

GPU cluster schedulers

ML training jobs are long running and resource intensive. They heavily rely on powerful, yet expensive accelerators like GPUs for computation. Therefore, it is crucial to efficiently manage the GPU cluster to ensure that jobs are well packed, and resources are not left idle.

Characteristics of ML training jobs

ML training jobs have very different characteristics compared to big data jobs which necessitates specialized scheduling technique and infrastructure. First, unlike big data jobs, the iteration time of ML jobs is predictable; they execute a repetitive set of tasks on the CPU and GPU, making their memory and compute usage across iterations predictable. Second, these long running training jobs have to be gang scheduled; i.e., if a job requests multiple GPUs to run, all the resources have to be allocated together. Third, when ML jobs are run across GPUs, they synchronize weights at regular intervals over the network; therefore scheduling decisions have to be sensitive to the GPU placement for the job, and collocate them when possible. Finally, the extent to which resource allocation and placement affects the throughput of the job, is model dependent, and varies from one model to another. Below, we discuss different dimensions along which GPU cluster scheduler optimizations have focused.

Resource Utilization

One of the primary goals of any scheduler is to maximize resource utilization in the cluster, so as to ensure that the expensive accelerators are not idle. Some of the key features in a GPU scheduler that maximize utilization are the ability to perform checkpointing (suspend and resume jobs), spatial and time sharing GPUs across jobs, and locality-aware placement as explored by prior work like Gandiva and Tiresias. The predictability in memory and compute usage of ML jobs makes it possible to leverage low-cost one-time profiling, and perform intelligent job placements across the cluster .

Recent work like Synergy also show that it is not only GPU utilization that needs to be maximized, but some ML jobs are bottlenecked on auxiliary resources like CPU and memory that are required to perform pre-processing operations. While this problem draws parallels to multi-dimensional resource scheduling, the fact that GPU resources have to be gang scheduled, but the auxiliary resources are fungible, makes this a unique problem to tackle.


A more challenging problem of resource utilization arises when the GPU cluster is heterogeneous, consisting of different generations of GPUs. For a given job, a scheduler must now choose from a wide variety of accelerators to  run the ML job. This is challenging because different models exhibit heterogeneous performance trends across accelerator types due to various architectural differences. In other words, the dollar ($) normalized throughput of jobs varies from one accelerator to the other for different models. Recent schedulers tackle the problem of improving resource efficiency as well as job throughput metrics like completion time or makespan by augmenting scheduling policies to be heterogeneity-aware. Gavel is a recent work that generalizes a wide range of existing scheduling policies, expresses them as optimization problems, and transforms them into heterogeneity-aware versions. Other works like Allox optimize for a single scheduling objective, and tightly couple their scheduling mechanism to that objective.


Shared GPU clusters for ML are increasingly becoming common for multi-tenant workloads due to their operational advantages. Nonetheless, the question that pertains is, how to allocate or share the available (possibly heterogeneous) GPU resources in a fair manner across different users? Themis is a scheduler that uses an auctioning mechanism where jobs bid for certain GPUs and a central arbiter determines the global winning bids to maximize the aggregate improvement in the job metric (finish time fairness) across all jobs. When the cluster is heterogeneous, newer generations of GPUs may face higher demand than the older hardware. To tackle such cases, Gandiva-Fair proposes a resource trading mechanism to incentivize users to use the older GPUs thereby improving resource utilization, while ensuring fairness. 


Typically, ML training jobs are assumed to have a fixed GPU requirement throughout its lifetime. However, dynamically reassigning the number of GPUs allocated to a job can improve overall cluster utilization as well as individual job metrics by shrinking jobs when the demand is high, and growing them during off peak loads. However, ensuring such elasticity alongside high resource utilization, while maintaining the statistical efficiency of the model is challenging. Hyperparameters of a training job such as the batch size and learning rate are carefully chosen for a certain GPU type and number. Altering this can affect job convergence. Recent work like Pollux provides a framework to jointly tune batch size and learning rate for elastic scheduling by observing the job’s throughput and statistical behavior during training, and using this information to model the job’s performance for different parameters.  More recently, Microsoft announced Singularity, its planet-scale scheduling infrastructure that can transparently preempt and elastically scale deep learning workloads. Unlike Pollux,  Singularity does not fiddle with the job’s hyperparameters. Its transparent checkpointing strategy and replica splicing allows it to maintain the same world size as a job grows or shrinks, thereby keeping the statistical efficiency intact.

Future directions

With the emerging trends and advancements in ML, massively growing datasets and models, availability of several kinds of heterogeneous accelerators,  and constantly moving hardware landscape, there are several exciting research problems to tackle in the space of scheduling ML jobs. A recent workload analysis from a production cluster at Alibaba shows that the increasing heterogeneity in workload and accelerators makes it challenging to achieve high cluster utilization.

Going forward, as GPU accelerators are becoming more powerful, it is crucial to dynamically share the GPU cores across different jobs, as not all jobs will be able to fully utilize the GPUs. The fact that there is so much heterogeneity in the workload, and varying compute complexity and memory requirements across jobs, makes it necessary for the scheduler to be workload-aware and enable fine-grained sharing of GPU memory and compute across multiple jobs.

Disaggregating resources on a monolithic machine (GPU, CPU, and memory) for better utilization of individual resources is an interesting future direction especially because some workloads could be predominantly CPU bound (due to heavy online pre-processing), or memory bound (due to large datasets, and slow storage). However, addressing the communication overhead that such disaggregation introduces is an important problem to solve. Recent efforts at Meta and Google’s Tensorflow explore the feasibility of such disaggregation in the context of CPU requirements of the input data pipeline.

About the Author: Jayashree Mohan is a Senior Researcher in the Systems group at Microsoft Research Lab, India. Her research interests are broadly in computer systems, storage, and systems infrastructure for machine learning applications.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.