As computer architects, one of our key tasks is to propose abstractions that improve system programmability in a manner that stands the test of time. One such abstraction, that has been crucial to the success of computing, is the concept of virtual memory. In this blog post, I discuss the challenges facing virtual memory today and some promising ways to address these challenges.
Why is virtual memory important?
Today, computer systems at all scales, from datacenters to wearable devices, depend on virtual memory. Virtual memory provides software developers the illusion that memory is always sufficient and linear, making it is easy to program. The hardware and OS manage the relationship between the virtual memory address space and physical memory address space. Perhaps the biggest testament to virtual memory’s success is that we don’t even think about it when writing code. And yet, consider what would happen in its absence. Without virtual memory, it would be practically impossible to program modern systems, which have a complex assortment of on- and off-package DRAM devices, hard-disks, solid-state disks, and more. Programmers would have to rewrite their code for every change in memory capacity or configuration. Systems would not be able to run multiple applications simultaneously, as they would be able to overwrite one another’s memory. Malicious programs would be able to corrupt the memory of other programs. In short, virtual memory is fundamental to programmability, code portability, memory protection, system security, and indeed, the very success of computing.
Why is virtual memory under threat?
But virtual memory is also under threat today. Fundamentally, the problem is this – the core virtual memory abstraction was conceived decades ago, and its basic components have remained largely unchanged since. In that time, hardware and software have changed dramatically. Massive mainframes made with discrete electronic components have evolved into systems integrating not just tens or hundreds of CPUs, but also exotic specialized hardware. Hardware accelerators for graphics (i.e., GPUs), video and signal processing, face recognition, deep learning, etc., are becoming de rigeur. Instead of using the timeshared systems that virtual memory was originally conceived on, we now run sophisticated algorithms on vast sets of data, maintain them in big key value stores, interact with our systems using speech and gestures, and embrace new paradigms like virtual reality. And yet, remarkably, we continue to use traditional virtual memory concepts.
Consequently, virtual memory is now a system performance bottleneck. Consider virtual memory performance for an application that analyzes a large graph. Graph processing often involves chasing pointers over terabytes of data in irregular and unpredictable ways, with poor memory access locality. Poor access locality is known to stress hardware caches, degrading system performance. But recent studies reveal a lesser known but crucial insight – poor access locality hampers the performance of the key hardware component of virtual memory, the TLB cache. TLBs often consume 20-40% of runtime on modern systems.
This bottleneck presents a serious problem. Processor vendors are already budgeting ever-increasing chip area on TLBs to try to improve performance. These structures are now so large (e.g., Intel’s Skylake chip uses 12-way 1536-entry L2 TLBs) and so frequently accessed that they alone can consume 15-20% of chip energy. Furthermore the virtual memory hardware stack is becoming increasingly complex. Whereas TLBs were traditionally the only hardware component used to accelerate address translation, processor vendors (Intel, AMD, ARM) now implement MMU caches for upper levels of the page table, thereby reducing TLB miss penalties. Similarly, whereas TLB misses were historically served by an OS routine to walk the page table, modern chips implement page table walkers entirely in hardware, obviating the need for slower OS-directed page table lookups. In fact, such is the concern over virtual memory overheads that modern chips are even beginning to implement multithreaded hardware page table walkers that can service more than a single TLB miss concurrently. For example, Intel’s Skylake and AMD’s Zen chips implement 2-4-way hardware page table walkers per core.
What is the research community doing about it?
The research community has not remained blind to these problems. In the last few years, several innovative solutions have been studied. For example, a particularly intriguing idea is that of direct segments, first proposed by Basu et. al. in their ISCA ’13 paper entitled “Efficient Virtual Memory for Big Memory Servers”. This work goes back to virtual memory basics and asks – what aspects of virtual memory are being used by modern workloads? It turns out that for an important and wide class of memory-intensive workloads, there is little paging activity or fine-grained memory protection usage. Furthermore, most memory accesses are to large anonymous regions of allocated memory space. These seemingly simple observations yield a powerful insight – if the OS can provide applications with a direct segment memory abstraction (essentially acting as a more massive and flexible version of a superpage), while retaining the paging abstraction for the remainder of the address space, TLB misses can be reduced dramatically. This brand of hardware-software co-design is an exciting direction, and has been followed up with ideas like range translations, etc. (see “Redundant Memory Mappings for Fast Access to Large Memories” at ISCA ’15).
Looking ahead, upcoming work on “Do-It-Yourself Virtual Memory” by Alam et. al., to be presented at ISCA ’17, goes beyond and designs a mechanism where applications are given complete freedom in choosing any mapping schemes for chunks of the virtual memory address space. The basic idea is to decouple virtual-to-physical mappings from access permissions to achieve this. This is a particularly interesting idea, which ultimately begs the question – how successful would application developers be in identifying and using mapping schemes? This seems like a natural question rich for further exploration.
While approaches that require hardware-software co-design are promising, another parallel approach has been to focus on purely hardware changes. In one flavor of this work, my group has shown that real-world applications and OSes often (though they don’t have to) allocate memory in a manner where tens of contiguous virtual pages are mapped to tens of physical pages. This enables lightweight TLB coalescing where a single entry can store information about multiple contiguous mappings (see “Coalesced Large-Read TLBs” at MICRO ’12 and “Efficient Address Translation for Architectures with Multiple Page Sizes” at ASPLOS ’17). Such hardware schemes are easy to implement and require no OS or software changes. Consequently, TLB coalescing schemes are being adopted by industry (e.g., AMD’s Zen chip supports TLB coalescing today). Looking ahead, an upcoming ISCA ’17 paper by Park et. al. entitled “Hybrid TLB Coalescing: Improving TLB Translation Coverage Under Diverse Fragmented Memory Allocations” suggests that there is plenty of scope to further exploit such mapping patterns.
Equally intriguingly, several recent studies suggest that there is performance to be squeezed out of non-traditional TLB designs. Upcoming work by Ryoo et. al. at ISCA ’17 entitled “Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB” is an exciting example of this. Conventional wisdom suggests that TLBs must always be designed to be small so that their access latencies are low. However, Ryoo et. al.’s work shows that alternate designs may be possible for multi-level TLB hierarchies, where it may be beneficial to back latency-critical higher level L1/L2 TLBs with slower but considerably larger in-DRAM TLBs.. In general, the notion of leveraging emerging memory technologies to implement non-traditional address translation techniques is, I believe, an area for rich innovation.
Finally, some recent work from my group also suggests that there may be traditionally overlooked aspects of address translation contributing to performance overheads. Specifically, our “Translation-Triggered Prefetching” paper at ASPLOS ’17 asks the question – when a memory access prompts a TLB miss, what is the overhead of its replay once the TLB miss is handled? While it seems intuitive that page table walks that miss in the cache are almost always followed by cache misses for the replay, replays have not traditionally been optimized for better performance. Consequently, we show that we can trigger prefetches of replay data into caches when TLB misses occur, improving system performance.
Successfully preserving virtual memory will require a combination of these types of approaches. In general, there is evidence that both chip vendors and OS designers are willing to innovate at this layer, as seen by recent implementation of TLB coalescing techniques and updates of the Linux/FreeBSD paging code. This presents both an opportunity and a challenge for budding researchers in computer systems – the changing landscape of hardware and software suggests that virtual memory abstractions are in flux, but also that simple mechanisms to preserve their performance are likely to be useful to real-world systems and products.
About the author: Abhishek Bhattacharjee is an Associate Professor of Computer Science at Rutgers University. HIs research interests span the hardware-software interface. He is also currently CV Starr Visiting Fellow at Princeton University’s Neuroscience Institute.