Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

This article continues (and concludes) the discussion on the proceedings of DPC-4, covering the remaining four contestants and a summary of the trends observed in all eight prefetchers presented in the championship. Similar to Part I, we focus on how each prefetch algorithm functions, and why it is effective. Finer implementation details can be obtained from the workshop papers or the source code.

BertiGO (Simranjit Singh, University of Murcia; Agustín Navarro Torres, University of Zaragoza; Alberto Ros (University of Murcia

Motivation

When evaluating the baseline prefetcher configuration, the authors noted that Berti frequently issues redundant prefetch requests for lines already prefetched or present in the cache. Also, using only the PC provides very limited context for pattern recognition, limiting the prediction capabilities of Berti. Furthermore, Pythia is found to generate a lot of useless prefetches for some workloads, which pollutes the L2 cache and wastes memory bandwidth.

Idea

  1. A Region-Based Bit-Map Filter is added, which is a fully associative structure storing the prefetched and accessed cache lines per region, in the form of a bit-vector. For regions tracked by the filter, having the M-th bit set in the bitmap implies that we drop all prefetch requests for the M-th cache line inside the region.
  2. In addition to using PC, the authors propose using a hash (shifted XOR) of the last 4 PCs with the current PC, to index the Berti tables with additional context.
  3. Set-Dueling is added to Pythia: instead of using the default policy to issue prefetches, 5 different policies are introduced, including a No-Prefetch policy that disables Pythia. All 5 policies are enabled for a 10M-instruction tournament, at the end of which the policy with the lowest miss rate is chosen for the rest of execution.
  4. An Adaptive Next Line (ANeLin) is added to the LLC, which uses a sampling cache to track the demand misses and insert next-line prefetches. A heuristic mechanism is used to track useful and useless prefetches globally and per-PC. ANeLin can be disabled if the ratio of useful to useless prefetches drops below a threshold.

Why It Works

Adding a Bit-Map Filter eliminates redundant and useless prefetches. Using PC history adds context from the program flow while learning memory accesses with minimal overhead. Disabling Pythia and Next Line prefetching when they do not generate enough useful prefetches solves the problem of cache pollution due to wasteful prefetching. This is especially useful in the constrained bandwidth and multicore scenarios where data and memory need to be shared judiciously for optimal performance.

Entangling Data Prefetcher (Agustín Navarro Torres, Universidad de Zaragoza;  Simranjit Singh, University of Murcia; Biswabandan Panda, IIT Bombay; Alberto Ros, University of Murcia)

Motivation

Comparing Berti with other state-of-art prefetchers, the authors identify a SPEC2017 workload where Berti achieves negligible performance gain over no-prefetch baseline. Profiling this trace reveals that it consists of long-reuse strides (stride accesses separated by 2K-cycle interval) and zero-strides (consecutive accesses to the same cache line). Berti cannot issue zero-delta prefetches, and even though prefetches are correctly issued for long-reuse deltas, they get evicted before the cache line gets accessed. T-SKID, a Time Skipping Prefetcher is built on top of a standard PC-Stride prefetcher, but decouples the PC that triggers a prefetch (TriggerPC) from the PC that trains the predictor(TargetPC). This allows it to prefetch long-reuse and zero stride patterns. However, the underlying stride prefetcher limits its scope to constant stride instead of complex delta patterns predicted easily by Berti.

Idea

EDP is proposed as a VA-based L1D prefetcher. It gets trained and triggered on cache misses or prefetch hits (cache hit on a prefetched line). For every TargetPC, it records the fill latency of the demand access or prefetch request. It then searches the global PC history for the most recent PC that was observed more than (current cycle – fill latency) cycles ago – this is the TriggerPC which could have triggered a timely prefetch for TargetPC. This ‘Entangling Pair’ of PCs is added to the Entangling Table, that stores the set of TargetPCs for a given TriggerPC. EDP also looks at the address history of each TargetPC to calculate the list of timely deltas (similar to Berti) and stores them with the current address in a Delta Table indexed by TargetPC. To issue prefetches, the TriggerPC is used to obtain one or more TargetPC, which are used to obtain address and deltas for timely prefetch. The prefetches calculated in this way are passed through a Bloom Filter to drop redundant requests, and then placed in a Proxy Prefetch Queue (PPQ) where the prefetch request waits till slots open up in the demand read queue. If there is no space in the latter, prefetch requests are not issued. Pythia is implemented at L2, with a throttling mechanism at LLC that tracks each core’s requests and sets the EDP aggressiveness.

Why It Works

Using a different PC to trigger prefetches allows EDP to successfully prefetch zero and long reuse delta patterns for its target PC. Filtering out redundant prefetches reduces contention for resources. Using a dedicated PPQ for prefetch requests prevents prefetches from competing with critical loads for resources. The LLC throttling mechanism helps evenly distribute resources in the multi-core scenario.

Composite Prefetching with Bandits (Charles Block, Pedro Palacios, Abraham Farrell, Gerasimos Gerogiannis, Josep Torrellas, University of Illinois at Urbana-Champaign)

Motivation

The authors point out that the current state-of-the-art prefetchers try to optimize low-level metrics such as accuracy, timeliness and coverage. The system performance (IPC) depends on these factors, but can have variable sensitivity to each of them depending on the workload and program phase. Furthermore, a single prefetcher is generally insufficient to deliver the best performance for a diverse set of workloads – industrial processors generally deploy a composite prefetcher consisting of multiple prefetch engines.

Idea

A Multi-Armed Bandit is a Reinforcement Learning agent that chooses the best action (arm) to maximize the reward function value. Inspired by this, a Micro-Armed Bandit (MAB) is used to prefetch at L2C. Each ‘arm’ consists of different configurations for 5 state-of-the-art prefetchers-

  • Next Line, Spatial Memory Streaming, Best Offset Prefetcher: Can be turned ON or OFF
  • Stride, Stream prefetchers: Degree can be tuned to control aggressiveness

A bloom filter is implemented to prevent issuing redundant prefetches. Each arm is used for a fixed time period (bandit step) after which the reward generated by it is evaluated by the agent. This is evaluated against the rewards generated previously to calculate which arm to use next. The total IPC of the core is used as a reward function for the MAB.

To optimize multi-core performance, another agent called ‘µMama’ is added at the system level, using the geometric mean of IPCs across all cores as a reward function. At each timestep, it decides whether to allow the cores to pursue their independent actions, or to force them into joint actions which have a record of increasing the µMama reward.

Why It Works

Using Reinforcement Learning to directly maximize the system performance ensures that the prefetcher dynamically re-configures itself with execution to improve IPC. The caveat is that this now becomes a search space problem – the arms of the bandit need to be diverse enough to support different kinds of workloads, in order to deliver the best performance.

Global Berti (Gilead Posluns, Mark Jeffrey; University of Toronto)

Motivation

Berti is a state-of-the-art prefetcher that detects Streaming patterns, i.e., consistent delta values between accesses by the same PC. Practical workloads however, often exhibit Spatial patterns identified by consistent delta values between accesses by different PCs. In the absence of streaming patterns, prefetching based on spatial patterns could alleviate the efficacy of Berti.

Idea

Global Berti detects spatial patterns using Berti’s existing structures – the History Table conventionally stores within a row, the addresses of all the lines accessed by a particular PC, in FIFO order. When a streaming pattern cannot be detected, local training is useless and Global Berti looks at the most recent address for all PCs to detect spatial patterns (global training). Berti’s Delta Table holds the row delta values for the same PC; Global Berti stores the global deltas (across PCs) in the same table, adding a local bit to differentiate between streaming and spatial training.

Why It Works

By itself, Berti is quite effective at detecting and covering streaming patterns. Adding the capability to detect spatial patterns in the absence of streaming patterns increases Global Berti’s coverage and therefore, the overall performance. As expected, the highest speedup over Berti is obtained in SPEC2017 and Graph workloads that are dominated by irregular accesses which require spatial prefetching. On the other hand, AI workloads containing mostly streaming patterns see a much lesser speedup.

General Trends

Although the major focus of almost all DPC-4 submissions is to overcome the limitations of the high-performing Berti/Pythia baseline, they highlight several key trends in data prefetching research:

  • Prefetching across Physical Page Boundaries: Issuing page-crossing prefetches is extremely useful for AI workloads since they are dominated by streaming accesses. This is leveraged by most submissions to gain an edge over the baseline prefetcher configuration.
  • Preventing Redundant Prefetches: Quite a few papers also combat excessive prefetching and resource contention through advanced throttling, priority, and filtering mechanisms.
  • Increased System-Level and Multi-Core Awareness: There is a growing emphasis on system-aware solutions to judiciously manage shared resources like memory bandwidth, which is constrained in high-core-count datacenters. This includes core-level fairness throttling (Emender, EDP) and global coordination agents (µMama) to dynamically adjust prefetcher configurations for optimal multi-core performance.
  • Expanding Pattern Coverage for Diverse Workloads: Submissions seek to improve coverage beyond simple streaming patterns. This includes detecting spatial patterns across different PCs (Global Berti), and targeting complex patterns like long-reuse and zero-strides (EDP). The adoption of PC history (BertiGO) also provides better context for pattern recognition.
  • Shift Towards Adaptive and Composite Designs: Recognizing that a  single prefetcher is insufficient for diverse workloads, the trend moves toward composite prefetchers. This is accompanied by dynamic re-configuration to select the best prefetcher setting at runtime, and adaptive heuristics to tune aggressiveness.

About the Author

Digvijay Singh obtained his Bachelor’s degree from BITS Pilani and his Master’s degree from Texas A&M University where he worked on data prefetching as part of the CAMSIN research group. He currently works as a Silicon Architect in Google’s mobile CPU team.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.