Implications of Machine Learning (ML), be the training or inference serving, have steered systems and architecture research accordingly. A significant amount of work is happening in the Systems for ML space ranging from building efficient systems for data preprocessing (e.g., Cachew, ATC’22) to automatic and timely (re)generation of accurate models (e.g., MatchMaker, MLSys’22) in various eco-systems driven by these models. We looked at the implications of ML on memory management recently in collaboration with Qinzhe Wu (The University of Texas at Austin), Krishna Kavi, Gayatri Mehta (University of North Texas); Lizy John (The University of Texas at Austin) and reflected on what the next generation memory allocators should look like. One of my co-advisees, Ruihao Li, with Prof. Lizy John at UT Austin, will present this vision at HotOS’23 in June 2023. This blog post briefly describes our vision where we make a case for offloading memory allocation (and other similar management functions) from main processing cores to other processing units to boost performance and reduce energy consumption.
The gap between processor speeds and memory access latencies is becoming an ever-increasing impediment to the performant execution of ML tasks. ML accelerators spend only 23% of the overall cycles on computations and more than 50% on data preparation as reported in the paper,”In-Datacenter Performance Analysis of a TPU”, published at ISCA’17. Even in general-purpose warehouse-scale datacenters, 50% of compute cycles are idle waiting for memory accesses, causing performance degradation for the applications.
Identifying this challenge, multiple libraries for memory allocation, e.g., Google’s TCMalloc, Microsoft’s MiMalloc, have been proposed and are widely used. To be impactful for real-world use-cases where we need to support concurrent requests for memory allocation, these libraries are multi-threaded; they maintain and rely heavily on complex metadata. While this has allowed memory allocation for concurrent requests, since the same core executes memory management and application code, there is contention over resources that ends up impacting overall performance. We ran a few measurement experiments where we compared the performance of four different memory allocators: PTMalloc2, Jemalloc, TCMalloc (OSDI’21), and Mimalloc on a set of representative workloads, from SPEC cpu2017. We observed that depending on the allocator used, the performance can vary by more than 10x.
So, we dug deeper. The existing memory allocation mechanisms that are implemented in software achieve higher performance compared with the default Glibc (PTMalloc2) memory allocators. But, since they ignore the underlying hardware, they still fall short in leveraging system-level optimizations, like reducing cache pollution and TLB misses. Alternative, to avoid the contention over resources, memory allocation functions are implemented on hardware accelerators. However, relying on customized hardware units may stand in their way of becoming general purpose memory management solutions or necessitate frequent changes to the hardware as algorithms evolve.
Next generation memory allocators will likely continue to be complex. One way to handle their complexity without impacting application performance is to isolate the allocation functions from rest of the code and provide separate resources to it. This will prevent allocators from polluting the cache and interfering with other metadata of applications. However, current allocators cannot easily be “plucked” out of the code and offloaded to dedicated cores, because the metadata is usually tightly coupled to the user data. Thus, the design of the next generation memory allocators requires innovations in software, hardware, and demands their co-design.
To summarize, we came up with an analogy to think about offloading the allocation to a separate dedicated core: Just like a baby in a big family, the memory allocator is growing up. Time has come when we need to give it a new room (core) in our house (CPU). This growing child raises multiple new research questions. How to strike the right trade-off between the overhead (additional inter-core communication) and the benefit (a reduction of cache pollution and asynchronous execution) of offloading memory allocators? The choice of the room type (i.e., type of processing core) is another research question. Should the room be the same as other rooms (i.e., other CPU cores), or a small room is enough for memory allocation? Can the room be used for other functions instead of exclusively using it for memory allocation?
About the author: Neeraja J. Yadwadkar is an Assistant Professor in the ECE department at University of Texas at Austin, and an Affiliated Researcher with VMWare Research. Most of her research straddles the boundaries of Computer Systems and ML.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.