Machine Learning (ML), specifically Deep Neural Networks (DNNs), is stressing storage systems in new ways, moving the training bottleneck to the data ingestion phase, rather than the actual learning phase. Training these models is data-hungry, resource-intensive, and time-consuming. It uses all of the resources in a server; storage, DRAM, and CPU for fetching, caching, and pre-processing the dataset (collectively called the input data pipeline) to the GPUs that perform computation on the transformed data. Additionally, there is an output data pipeline that periodically checkpoints the model state to persistent storage.
This blog post provides an overview of why an efficient ML data pipeline is critical to speeding up ML training, recent research efforts in this space, and future research directions. Prof. Klimnovic had talked about data storage and preprocessing for ML in an earlier blog post; this post goes into more details on the same topic.
Importance of ML data pipeline: Faster accelerators and growing dataset sizes
The GPU compute speeds today are growing at an unprecedented rate. For instance, the latest NVIDIA A100’s Tensor Cores provide up to 20x higher performance over the prior generation. Similarly, open-source datasets are exploding in size; For example, in contrast to the popular 140GB ImageNet-1K dataset, the youtube-8M dataset used in video models is about 1.53 TB for just frame-level features, while the Google OpenImages dataset is about 18 TB. This growth trend in computational power and dataset sizes is definitely a boon for DNN training; but not until we have efficient infrastructure to utilize them!
Having all the compute and data in the world is worthless if we cannot bring in data for the compute units to munch on, at a comparable rate. The input data pipeline for DNN training today cannot keep up with the speed of GPU computation, leaving the expensive accelerator devices stalled for data. A recent study of millions of ML training workloads at Google show that jobs spend on average 30% of their training time on the input data pipeline. While several research efforts in the past have focused on accelerating ML training by optimizing the GPU primitives, many challenges remain in optimizing various parts of the data pipeline.
A typical ML input pipeline looks as follows. First, raw, compressed training data is fetched off remote (Azure blob, or S3) or local (SSDs/HDDs) storage. It is then decoded, and pre-processed at the CPU on the fly using transformations such as crop, flip, etc. Finally, a batch of pre-processed data is copied over to GPU memory for learning. This data pipeline is executed in parallel with GPU computation, iteratively, for several passes over the dataset (called epochs).
Mitigating the data bottlenecks in ML Training
Addressing I/O bottlenecks. ML training datasets today far exceed the DRAM capacity in a server, creating an I/O bottleneck in training. Recent research shows that some DNNs could spend upto 70% of their epoch training time on blocking I/O despite data prefetching and pipelining. The underlying cause of this problem is two-fold; storage bandwidth limitations and inefficient caching. New hardware like NVIDIA’s Magnum IO, and PureStorage’s AIRI are emerging to address this I/O bottleneck using massively parallel storage solutions.
Fast, expensive storage solutions like above are not available on commodity servers. This makes it important to optimize DRAM caching to minimize slow disk accesses. Recent works like CoorDL, Quiver and Hoard propose domain-specific caching techniques to deliver data efficiently while training machine learning models. These systems exploit predictability in the data access pattern to better cache the input dataset and eliminate expensive local or remote disk accesses in different scenarios like single-node training, hyperparameter search, and distributed training.
ML models must be periodically checkpointed so that they can resume from a correct state in case of job interruptions. Recent work like CheckFreq explore how to automate ML checkpointing frequency, while reducing model recovery time and checkpointing overhead.
Accelerating pre-processing. Pre-processing data using the CPU is a major bottleneck in several ML models. A recent study of ML training with public datasets revealed that data preprocessing accounts for up to 65% of the epoch time.
As GPUs are becoming more powerful, the CPU to GPU ratio even in state-of-the-art ML-optimized servers like the DGX-2 are insufficient to mask the overhead of CPU pre-processing. Therefore, NVIDIA introduced DALI, a GPU-optimized data pre-processing library that can offload a part of pre-processing to GPUs to overcome the CPU bottleneck.
ML training workloads, especially vision and speech models employ complex data pre-processing in the critical path, which is unique to each epoch. In common multi-job training scenarios like hyperparameter search, this overhead can be mitigated by carefully synchronizing similar jobs and eliminating redundancy in their data pipeline. Prior work like OneAccess and CoorDL explores this idea.
Model search is another common training scenario that involves exploring several new model architectures and hyperparameters, often requiring hundreds of trials. Systems like HiveMind and Cerebro explore how to parallelize these trials to better exploit the available resources using cross-model operator fusions, and new parallel SGD execution techniques like model hopper parallelism.
A team of researchers at Google Brain recently proposed data echoing, a technique that reuses data from prior batches if the accelerator is stalled for the current batch of data. As one can observe, this approach relaxes the strict data ordering and randomness requirements of SGD, and needs a careful selection of echoing rate, as well as hyperparameter tuning to achieve SOTA accuracy.
With the fast evolution of new ML models, growth in the need for large-scale distributed training infrastructures, changing hardware configurations, and emerging storage technologies, there is no dearth for new challenges in this space. Here are some promising future directions.
The compute-storage tradeoff. A common question that arises with respect to pre-processing stalls is, why not cache the intermediate pre-processed states for use in subsequent epochs? The answer being, decoding the compressed data blows up the dataset size by 5-7x. However, with the emergence of faster, affordable storage like NVMe SSDs, and Persistent Memory, it may be worth trading off storage for compute.
The cost-performance tradeoff. Recent work reveals that stalls due to slow data pipeline squander away the improved performance of faster, expensive GPUs, resulting in lower value/$ spent. It may therefore be more cost-effective to train some models on slower, less expensive GPUs, rather than under-utilizing faster, expensive GPUs.
Output data pipeline. Checkpointing model state is extremely important for long-running ML training jobs. However, frequently checkpointing large model state such as that of language models like GPT-3 could be inefficient due to recurring storage writes, and wasteful consumption of storage bandwidth. It might be worthwhile to explore incremental checkpointing approaches that monitor and write only model state that has changed since the last checkpoint.
About the Authors:
Jayashree Mohan is a graduate student in the Department of Computer Science at the University of Texas at Austin. She is primarily interested in systems for machine learning. Previously, she worked on building tools for file-system reliability.
Vijay Chidambaram is an Assistant Professor in the Department of Computer Science at the University of Texas at Austin. His research group, the UT Systems and Storage Lab, works on all things related to storage.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.