Mainstream non-volatile main memory (NVMM) is just around the corner: Intel is opening up access to their 3DXpoint technology to a broader range of companies, and we are gradually learning more about the technology. Despite our growing understanding, the question of how applications can most efficiently (in terms of both programmer effort and performance) use NVMM remains wide open.
In my first Computer Architecture Today blog post, I described one possible path forward for the software that runs on NVMM. I envisioned three steps in the evolution of NVMM programming:
- NVMM 1.0: Programmers would use NVMM as a large-capacity DRAM replacement or as a drop-in replacement for existing storage media. This would lead to respectable — but not enormous — speedups.
- NVMM 2.0: A cadre of expert programmers would tackle the difficult challenge of building persistent in-memory data structures and expose them to applications as libraries (e.g., a persistent standard template library).
- NVMM 3.0: More robust language and compiler support would make it possible for developers to quickly build similar data structures for their particular need.
2.0 and 3.0 are exciting because they offer really large speedups (20x in some cases), but they still far from being a reality: NVMM programming experts remain a rarity and developing flexible, testing, deploying, and trusting feature-rich libraries takes time. Good compiler and language support is probably a decade away.
This post revises this vision a bit based on my group’s recent experience tuning applications to use NVMM. Our results point toward “NVMM 1.5”, a model that is easily within reach and offers many of the benefits of full-fledged persistent data structures.
The Easy Way
The easiest way to use NVMM is to treat it like a storage device, run a file system on it, and use normal system calls to access it. You can choose to use a conventional file system built for disks (hard or solid state) or you can use a specialized file system like NOVA or ext4 or XFS in “direct access” (DAX) mode. Using a specialized file system speeds up applications by anywhere from a few percent to 5x.
The Fast Way
In my original blog post (and in papers I and others have written) I claimed that if you want to see the real benefits of NVMM, you needed to throw away the conventional file system interface and embrace NVMM as the memory it is.
Using NVMM as memory means building pointer-based, persistent data structures that resemble the trees, tables, heaps, and lists we are familiar with. For example, consider RocksDB, a high-performance embedded key-value store based on log-structured merge trees (LSM-trees). LSM-trees perform many synchronous append operations as it merges the upper layer of the tree into lower layers. With NVMM we can optimize RocksDB by replacing the DRAM-based skip-list it uses with a persistent version. We tried this and the performance gains are large: between 2.6 and 22x, depending on the underlying file system. We see similar results with some other key-value stores.
But building that persistent data structures like that skip-list is hard because they must simultaneously be fast, highly-concurrent, and reliable in the face of power failures. How hard? Well, there has been at least one top-tier conference paper per year since 2015 proposing a new implementation of a NVMM B-Tree. At this rate, a performant, persistent version of the standard template library is going to take many years (and probably support a good number of Ph.Ds).
Not So Fast!…But Much Easier
There is, however, a middle ground between easy and fast that discards parts of the conventional file system interfaces (open(), close(), read(), write(), and mmap()) put preserves the basic approach to accessing storage that those interfaces provide.
Rather than adapting complex data structures to NVMM, we can emulate well-known file operation in user-space. Appending to a file is an especially promising target because many databases use it for logging. Appending to an NVMM-based file in user-space is easy:
- Open the file, preallocate some space with fallocate(), and map it with mmap().
- Instead of write(), use memcpy() to copy data into file. Instead of fsync(), use the clwb (cache line write back) instruction to force the updates to persistent memory.
These changes do not require any fancy data structures, or changes to the underlying file format, or alterations to the system’s basic algorithms. For RocksDB, this change increases performance by between 2.2 and 19x — almost the same speedup as a custom persistent data structure, but for much less effort.
LMDB is another instructive example. It is a database that uses mmap() to access a BTree stored in a file. On a conventional storage system, it provides consistency by using msync() to force dirty pages to disk. Msync() is slow because 1) it is a system call and 2) it operates on pages. If the application only changed a few bytes, it is overkill.
Using an NVMM-enabled file system eliminates both of these problems because the application can implement msync() in user-space with clwb instructions that target only the modified cache lines. Replacing msync() with fine-grain flushing increases LMDB performance by between 11x and 14x, and the changes required to the code a minimal.
Fixing the Rest of the System
The techniques are described above are easy to apply, but it is even simpler (for the programmer) if the operating system could improve performance “under the hood.” It turns out there is plenty of room for improvement in the operating system’s storage stack, especially when it comes to scalability.
Here are few examples: Concurrent read/write to shared files access scales poorly, as does (surprisingly) creating unshared files in unshared directories. Moving and renaming files and directories is another problem area.
All these problems predate NVMM, but NVMM exacerbates them because it is fast. For IO-intensive applications, fast storage means more IO operations per second and that means more contention for locks. The continued rise in core count makes matters still worse.
When I saw the results for RocksDB, LevelDB, and the other application we looked at, I had mixed feelings. On one hand, getting good performance gains without much effort is a great deal. On the other, these results seems to reduce the urgency of solving the juicy and interesting research problems presented by building complex pointer-based structures in NVMM. Nevertheless, our results suggest that treating NVMM as memory is still both the path to maximum performance and, in my opinion, the most elegant way to apply these new technologies.