I’m not sure if I should be writing a blog for architects. As some of you know, my expertise is in database systems. In response to this blog, I’m likely to get flames from some of you informing me about how I have missed the boat in my observations. If that happens, then all is not lost as we may be able to bring the architecture and database communities closer with this discussion. For this I can say confidently: some of us in the database community are not sure about what parallelism model future commodity processors will present, and as corollary what we should do from both an algorithmic and a software engineering perspectives so that data processing software can use future hardware efficiently.
Over the next few blogs, I’ll go into a number of issues expanding on this point. In this blog, I argue that commodity main-stream processors are overly complicated, and we should simplify them to expose a simple shared-nothing view of parallelism, and if needed build any complexity in the software layer. Thus, I’m advocating for KISS as the default principle for designing commodity processors.
First, a refresher on parallelism models for database systems
Back in the 1980s when database systems needed to scale-out to work on multi-processor systems, there was a vibrant debate in the community about different models of parallelism. Two of the popular models were shared-nothing and shared-memory. In shared-nothing, the hardware model is that of independent processors, and there is no sharing of any storage resources. Thus, the (database) software explicitly deals with data partitioning and synchronization by moving data across a communication medium. In this case, the database software is very aware of data movement costs, and hence has explicit methods to optimizing for it. In the shared-memory model, there is global memory addressable across all the processors, and common data structures (e.g. buffer pool and lock tables) can be put there. The shared-memory model made it easier to port existing database software to the parallel processing world, but quickly ran into performance issues. Why? Systems that used the shared-memory model often found that to get high performance, they had to worry about issues such as data partitioning and data movement too! The bottom-line is that shared-nothing while more complicated, is a better way to build high-performance database systems. It exposes a simple way to reason about the hardware and the actual costs associated with parallelism are dealt with in a direct way in the software.
The slide into forced shared-memory
Since the start of the multi-core era, we have now slipped into a hardware model that defaults to what I call a “forced shared-memory” model. Today, if you are on a multi-processor machine, and care about performance, you have to be careful about the location of the memory allocation, and the binding of computation to the location of the associated data (the “gift” of NUMA). The processor is going to do things like run cache coherence protocols and share caches across cores on the same processor (e.g. at the L3 level), all of which gets in the way of having a transparent performance model that the software can reason about. For example, at one instance, access to your data may be fast, but the very next access could be slow because of some coherence protocol effect that invalidated your copy (or data that happens to be on the same cache line as your data). Or, some data in L3 just got bumped out of the L3 because of some other computation on some other processing unit. (All these problems get worse when the infrastructure is shared across applications.)
Now some of you might say to me that if I don’t like the default forced shared-memory model, then I can engineer around it by using mechanisms like NUMA-aware memory management, use (L3) cache partitioning/allocation primitive, etc. Sure, we can workaround some issues, but other features (e.g. cache coherence), are always on. More importantly, the complexity is in the wrong place, especially as we look ahead to the future.
Time to simplify?
Thus, it is likely time to simplify the parallel processor performance model by simplifying (commodity) processor design itself. Why not go back to having simpler processing units that can be bundled easily into shared-nothing clusters? So, go ahead and take away things like the coherence protocols. Make these processing widgets have a transparent and more predictable performance model. Let applications worry about how to achieve efficient sharing. If we need to provide a coherent memory model, then let’s build that logic/magic into a systems layer above (we need to do that anyways for clusters). This approach is also going to be easier to debug and improve, as you have to change a software implementation rather than a hardware implementation! One might even be able to create a compiler/translator so that current applications can work with little or no modification (although perhaps slowly) on such a hardware. Applications that care about performance, will likely have to do what database software did decades ago – work with a shared-nothing hardware view, and build-in methods to explicitly worry and optimize data movement. Astute readers may have noted that this way of thinking is inevitable for applications that need to scale/break-out beyond the limited parallelism that can be offered in a single box, so that is another key advantage.
Some of you might say that specialization of type X (your favorite type goes here) is essentially doing some or all of this. More power to you, I say. But, how about complementing such work with efforts to simplify the default main-stream commodity processors that largely run the world today. The approach described above may also free up resources on the processor die for architects to dream up other ways to use it in creative ways.
Jignesh Patel is a Professor of Computer Sciences at the University of Wisconsin and works on databases.