This article is the first in a two-part series that summarizes the key contributions of 4th Data Prefetching Championship (DPC-4), held in conjunction with the 32nd iteration of HPCA in 2026. While discussing innovative data prefetching techniques presented in this contest, we focus on the functionality of proposed algorithms and also explain why they are effective. Finer implementation details can be found from the papers or the source code.
Implementation Constraints
All prefetchers are evaluated against a baseline configuration that employs: Berti prefetcher (DPC3 winner) at L1D (Level-1 Data cache) and Pythia prefetcher at L2 (Level-2 cache). While there were no constraint on design complexity, upper limits were defined on the storage budget of the prefetchers to ensure the design was practically feasible for implementation. These limits were defined as follows: L1D Prefetcher: 32KB, L2 Prefetcher: 128KB, LLC (Last Level Cache) Prefetcher: 256KB.
Keynotes
The event included two keynote talks. The first keynote, titled “Is Prefetcher Research Still Alive?”, was given by Leeor Peled from Huawei. Leeor discussed the modern relevance of prefetching research, offering a pragmatic philosophy for academic researchers. He argued that the primary objective should not necessarily be to surpass “best-in-class” models – which are often the result of years of ‘engineered’ fine-tuning – but rather to introduce novel, high-potential concepts that invite further optimization. He emphasized that while an individual effort might not immediately surpass the state-of-the-art, a sufficiently “interesting” technique can evolve into a transformative solution through subsequent community-driven iteration.
He suggested two optimizations that can be explored:
- Building a Semantic Prefetcher that correlates memory accesses with address generating code, i.e., a high-precision version of the Runahead Prefetcher that selectively runs only the code responsible for generating a future address.
- Training neural networks to identify deep correlations between memory accesses, potentially unlocking the ability to predict complex, non-linear patterns that remain invisible to current heuristic-based logic.
The following issues can (and should) be addressed to build better prefetchers:
- Generalizing complex patterns, e.g. pointer chasing loads
- Accurately choosing memory access with high correlation for better training
- Prefetching to the appropriate cache level to optimize for timeliness
- Throttling prefetches for fairness amongst multiple cores
- Using LLMs to process memory traces instead of text sequences
The second keynote, titled “Data Prefetching: A Datacenter Perspective”, was given by Akanksha J. from Google. Addressing the memory bottleneck problem in modern datacenters (40% of the CPU cycles are spent idling for memory responses) Akanksha highlighted that cloud environments are characterized by massive multi-threading and incessant context switching. In these scenarios, a single thread may migrate across multiple cores, while each core rotates through a vast “plethora” of applications. The Google workloads utilized in DPC-4 are a better representation of this reality, and are primarily frontend-bound. Without a sophisticated instruction prefetcher to streamline code delivery, the underlying bottlenecks in data prefetching remain obscured and impossible to solve. She also analyzed structural failures of current prefetching solutions, identifying these primary aspects:
- Current design philosophy focuses on “tuning for the common case,” resulting in hard-coded heuristic values—such as fixed confidence thresholds and prefetch degrees—that are taped out into non-programmable silicon. While these “black boxes” are meticulously engineered to squeeze every drop of performance from SPEC workloads, they lack the flexibility required for the high heterogeneity of datacenter tasks. Consequently, these resource-hungry techniques often penalize cloud performance rather than enhancing it.
- If we disable hardware prefetchers entirely and rely on software to insert prefetches, we miss out on critical opportunities to utilize valuable information about system states (coherence, timeliness, cache hits/misses) that improves prefetching. Akanksha proposed a shift towards “Software-Defined Prefetching,” a paradigm that transcends current ISA limitations. In this model, the software layer dynamically selects which code segments to target and determines the optimal hardware prefetcher to activate for peak accuracy. Simultaneously, the hardware leverages real-time system state data to maximize coverage.
Furthermore, Akanksha advocated for evaluating all prefetching techniques within constrained-bandwidth environments, arguing that such stress tests better reflect the realities of modern compute environments.
Now, on to prefetcher designs themselves.
Virtual Inter-Page Prefetcher (Ho Je Lee, Won Woo Ro; Yonsei University)
Motivation
- Analyzing the baseline prefetecher configuration, the authors observed that the L2 Prefetcher (Pythia) is more effective than the L1 Prefetcher (Berti) in reducing Misses Per Kilo Instructions (MPKI) for the Last Level Cache (LLC).
- Since Pythia operates in the Physical Address (PA) space, it is not feasible to let it issue prefetches across page boundaries, as incorrect physical page access poses a security risk.
- A roofline study shows that there is significant performance to be gained when Pythia is allowed to issue page-cross prefetches in the PA space. This advantage amplifies when it is granted visibility of the Virtual Address (VA) space, preventing incorrect page accesses.
Idea
VIP is implemented at L1 level, but issues prefetches to the L2. It gets trained on L1 Misses by reading the {PC, VA} information off the packets sent to L1 MSHR. These are written to the VIP Stride Table that calculates the observed stride for a particular PC and stores it. If a stride value is repeated, the confidence gets incremented. Otherwise it gets reset. The confidence value determines the prefetch degree.
Why It Works
The implemented VIP configuration is a simple yet elegant solution to gain performance over the baseline by supplementing the existing Berti and Pythia prefetchers with cross-page prefetches (note that the DPC-3 version of Berti operates in the PA space and cannot issue prefetches across page boundaries). As expected, the stride prefetcher boosts AI workloads with sequential accesses of large data structures that span across pages. The typical CPU workloads such as SPEC see a moderate gain; the control-flow dominated Google workloads have a marginal slowdown since they rarely have uninterrupted streams.
Signature Pattern Prediction and Access-Map Prefetcher (Maccoy Merrell, Lei Wang, Paul Gratz, Stavros Kalafatis; Texas A&M University)
Motivation
Access Map Pattern Matching (AMPM) and Signature Path Prefetching (SPP) are both considered state-of-the-art prefetching techniques; while SPP is sensitive to the order of memory accesses, AMPM is resistant to OoO execution. However, AMPM relies heavily on stored patterns for each region and is unable to issue prefetches for new regions or when the observed accesses deviate from expectations. SPP excels at this and can even make predictions from its issued prefetches.
Idea
Implemented at L2 level, a Region Table (RT) tracks all access maps (as bit-vectors) on a per-region basis. Upon a memory access, an N-bit portion from the respective access map is used to index a Pattern Table (PT). The PT outputs the most frequently occurring N-bit pattern as a prefetch candidate, which can be used to speculatively index the PT. Similar to SPP, speculative prefetching continues till the overall confidence drops below a threshold. The RT access map indicates the recently accessed cache lines and filters out redundant prefetches.
Why It Works
The authors have identified the complementary nature of SPP and AMPM, and have combined them effectively to utilize the OoO resistance of AMPM with the Speculative mechanism of SPP. Additionally, numerous throttling mechanisms are implemented which consider pattern usefulness as well as global metrics such as DRAM bandwidth and overall usefulness to drop prefetches and set prefetch degree. SPPAM is implemented at L2C with Berti (the MICRO version which operates in the VA space) at L1D and Bingo at LLC. Similar to the previous paper, the cross-page stream information is passed to SPPAM from L1D.
Emender (Jiajie Chen, Tingji Zhang, Xiaoyi Liu, Xuefeng Zhang, Peng Qu, Youhui Zhang; Tsinghua University)
Motivation
An evaluation of different combinations of state-of-the-art prefetchers shows that VBerti (L1D) and Pythia (L2) is the highest performing combination. Here, VBerti refers to the MICRO version of Berti that operates in the VA space, allowing it to issue page-crossing prefetches. It is observed that this optimal prefetcher combination issues too many prefetch requests that fill the prefetch queue quickly, which leads to useful prefetches getting dropped. A second-order effect of a full prefetch queue is the excessive usage of L1D to Memory bandwidth that can delay critical loads.
Idea
Four key features are added to tackle the problem of over-prefetching in the VBerti+Pythia configuration:
- Pending Target Buffer is added to sort all issued prefetches by confidence, which helps prioritize useful prefetches between different PCs.
- Cuckoo Filter is added which tracks the VAs already present in the cache to prevent redundant prefetches. This structure is chosen due to its O(1) query time, high accuracy and zero false negatives.
- Dynamic Confidence Threshold is added which increases with the cache miss rate, throttling low-confidence prefetches.
- A Fairness-based Throttling scheme is implemented across cores, which tracks the useless prefetches per-core at L3 and stops the core with the most useless prefetches from prefetching.
Why It Works
The authors identify problematic areas in the baseline Berti+Pythia system and propose features to effectively address them. The best performance improvement comes from the Cuckoo Filter for single-core and Fairness Throttling for multi-core configuration. Since Emender provides the least gain for limited bandwidth configuration, it would be interesting to look at the accuracy data.
sBerti (Jiapeng Zhou, Ben Chen, Kunlin Li, Yun Chen; HKUST, Guangzhou)
Motivation
When profiling the DPC4 workloads on the given baseline prefetcher configuration (Berti + Pythia), the authors observed a high L1D miss rate in the AI-ML and Google workloads. A deeper analysis of the traces indicated that most of these misses occurred when the access stream moved across the 4KB physical page boundary, which happens frequently in these workloads. The version of Berti used in the baseline does not issue prefetches across page boundaries, and thus, a stride prefetcher can help.
Idea
A decoupled Smart Stride Prefetcher is added at L1D, which operates on the VA space and can therefore track memory access streams across page boundaries. It is trained using a Smart Stride Table (SST), which is indexed by a hash of the PC, and subtracts the lastVA from the current VA to calculate the delta value. If the absolute value of delta is a multiple of the stored stride, the confidence is updated; this also provides resistance to out-of-order execution. Prefetches are issued if this confidence is greater than a static threshold. The lookahead is tuned via a heuristic which is incremented upon observing late prefetches and decremented by timely prefetches. A Recent Prefetch Table stores the recently issued prefetches to track their timeliness and filter duplicate prefetches between Berti and Smart Stride engines.
Why It Works
The addition of a decoupled stride prefetcher gives sBerti the ability to issue prefetches across physical page boundaries, reducing the “Cold-start Penalty” of Berti. The heuristic based dynamic distance adjustment helps tune the aggressiveness at runtime, allowing longer lookahead for AI-ML workloads dominated by streaming accesses. The final sBerti configuration (Stride + Berti at L1D, Pythia at L2) delivers the best performance in a full bandwidth scenario, where the stride engine can prefetch further ahead.
We will overview the rest of the prefetchers in part 2 of this post.
About the Author
Digvijay Singh received his Bachelor’s degree from BITS Pilani and his Master’s degree from Texas A&M University where he worked on data prefetching as part of the CAMSIN research group. He currently works as a Silicon Architect in Google’s mobile CPU team.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.